Giter Club home page Giter Club logo

duckdb_iceberg's Introduction

DuckDB logo

Github Actions Badge discord Latest Release

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb_iceberg's People

Contributors

carlopi avatar dpgrev avatar fokko avatar mytherin avatar samansmink avatar szarnyasg avatar tishj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

duckdb_iceberg's Issues

Correct field projection not working

In Iceberg the tables are projected using field-IDs. Even if the column is renamed (and Iceberg is lazy, so existing data will not be rewritten), it should correct map the name.

Example:

from pyiceberg.catalog.sql import SqlCatalog

catalog = SqlCatalog("test_sql_catalog", uri="sqlite:///:memory:", warehouse=f"/{warehouse}")
catalog.create_namespace("default")

import pyarrow as pa

pa_schema = pa.schema([
    pa.field('animal', pa.string(), nullable=False),
    pa.field('n_legs', pa.int32(), nullable=False),
])

# Empty table
tbl = catalog.create_table("default.test_id_projection", schema=pa_schema)

# Write some data
tbl.append(
    df=pa.Table.from_arrays([pa.array(["Flamingo", "Horse"]), pa.array([2, 4])], schema=pa_schema),
)

# Rename the column
with tbl.update_schema() as upd:
    upd.rename_column("n_legs", "number_of_legs")

# Write more data
tbl.append(
    df=pa.Table.from_arrays(
        [pa.array(["Brittle stars", "Centipede"]), pa.array([5, 100])],
        schema=pa.schema([
            pa.field('animal', pa.string(), nullable=False),
            pa.field('number_of_legs', pa.int32(), nullable=False),
        ]),
    ),
)

location = tbl.metadata_location

import duckdb

duckdb.sql('INSTALL iceberg; LOAD iceberg;')
result = duckdb.sql(
    f"""
SELECT *
FROM iceberg_scan('{location}')
"""
).fetchall()

assert result == [
    ('Flamingo', 2),
    ('Horse', 4),
    ('Brittle stars', 5),
    ('Centipede', 100),
]

This fails with:

>       ).fetchall()
E       duckdb.duckdb.IOException: IO Error: Failed to read file "//private/var/folders/22/yb9h2zd55ql37h4_50xkmg7r0000gn/T/pytest-of-fokkodriesprong/pytest-8/test_sql0/default.db/test_id_projection/data/00000-0-d24c398f-f93f-420c-82d4-2a66b72c580e.parquet": schema mismatch in glob: column "number_of_legs" was read from the original file "//private/var/folders/22/yb9h2zd55ql37h4_50xkmg7r0000gn/T/pytest-of-fokkodriesprong/pytest-8/test_sql0/default.db/test_id_projection/data/00000-0-f2b7f973-f7df-4cd6-aa7a-78176b3c3d7f.parquet", but could not be found in file "//private/var/folders/22/yb9h2zd55ql37h4_50xkmg7r0000gn/T/pytest-of-fokkodriesprong/pytest-8/test_sql0/default.db/test_id_projection/data/00000-0-d24c398f-f93f-420c-82d4-2a66b72c580e.parquet".
E       Candidate names: animal, n_legs
E       If you are trying to read files with different schemas, try setting union_by_name=True

Could not read iceberg table from s3 "...metadata/version-hint.text": 404 (Not Found)"

What happens?

Can't read iceberg table created with Trino on s3.

D SELECT count(*) FROM iceberg_scan('s3://test/iceberg_p2/orders'); Error: HTTP Error: Unable to connect to URL "https://test.gateway.storjshare.io/iceberg_p2/orders/metadata/version-hint.text": 404 (Not Found) D

trying same way but local drive:

D SELECT count(*) FROM iceberg_scan('/Users/yuriygavrilov/Downloads/orders', ALLOW_MOVED_PATHS=true); Error: IO Error: Cannot open file "/Users/yuriygavrilov/Downloads/orders/metadata/version-hint.text": No such file or directory

it seems the problem in file metadata/version-hint.text but why this is mandatory to have it? .. Trino didn't ask it.

To Reproduce

try to open this with iceberg format
orders.zip

OS:

14.0 (23A344)

DuckDB Version:

v0.9.1

DuckDB Client:

cli

Full Name:

Yuriy Gavrilov

Affiliation:

S7 Airlines

Have you tried this on the latest main branch?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

0.10.0 Regression - Cannot parse metadata

Starting with DuckDB 0.10.0, I can no longer use the extension to scan iceberg files

select * from iceberg_scan('s3://zones/iceberg/test/zones_43_1b7d1e67-8f53-42ab-ba55-a66f438f344f/metadata/00006-0d6ca258-8039-446f-843a-69e21344f272.metadata.json');

Error: IO Error: Invalid field found while parsing field: required

The above works fine with DuckDB 0.9.2

I attached the metadata file, which was created by Iceberg.
00006-0d6ca258-8039-446f-843a-69e21344f272.metadata.json

Table schema evolution support

First of all, thank you for this useful extension.
I can use the extension to read iceberg tables just fine.
As soon as the schema changes, the extension throws an error

Use case:

Using Trino to manage Iceberg on AWS/Glue/S3, I issued the following:

create table test(a int);
insert into test values(1);
alter table test add column b int;
insert into test values(2,5);

select * from test;
a | b
---+------
1 | NULL
2 | 5

Using the last metadata file, I issued the following in DuckDB (after loading the aws and iceberg extensions)

select * from ICEBERG_SCAN('s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/metadata/00005-5a37e1af-dbf3-48c7-b7c7-11309ecc6279.metadata.json');

Error: IO Error: Failed to read file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085221_00020_8b7ws-30eb76d3-2f3e-4a01-a030-976f7640d26a.parquet": schema mismatch in glob: column "b" was read from the original file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085314_00023_8b7ws-1e49b4ae-9d23-469b-8e00-6dd48da1d0a0.parquet", but could not be found in file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085221_00020_8b7ws-30eb76d3-2f3e-4a01-a030-976f7640d26a.parquet".
Candidate names: a
If you are trying to read files with different schemas, try setting union_by_name=True

As I mentioned above, reading the data before the schema change worked just fine.

Thank you,
Harel

How to build iceberg extension

I tried to compile the iceberg extension following the steps in the readme.md, but I ran into a problem, the problem is as follows:

mkdir -p build/release && \
cmake   -DDUCKDB_EXTENSION_NAMES="iceberg" -DDUCKDB_EXTENSION_ICEBERG_PATH="/home/gpadmin/duckdb_iceberg/" -DDUCKDB_EXTENSION_ICEBERG_SHOULD_LINK=1 -DDUCKDB_EXTENSION_ICEBERG_INCLUDE_PATH="/home/gpadmin/duckdb_iceberg/src/include"  -DEXTENSION_STATIC_BUILD=1 -DCMAKE_BUILD_TYPE=Release -DEXTENSION_STATIC_BUILD=1 -DBUILD_EXTENSIONS="httpfs" -DOSX_BUILD_ARCH=   -DVCPKG_MANIFEST_DIR='/home/gpadmin/duckdb_iceberg/' -DVCPKG_BUILD=1 -DCMAKE_TOOLCHAIN_FILE='/home/gpadmin/vcpkg/scripts/buildsystems/vcpkg.cmake' -S ./duckdb/ -B build/release && \
cmake --build build/release --config Release
-- Running vcpkg install
warning: Embedding `vcpkg-configuration` in a manifest file is an EXPERIMENTAL feature.
Detecting compiler hash for triplet x64-linux...
The following packages will be built and installed:
    avro-cpp[core,snappy]:x64-linux -> 2022-11-07 -- /home/gpadmin/duckdb_iceberg/./vcpkg_ports/avro-cpp
  * boost-algorithm:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-algorithm/6a184d48bcf1a89a12f66255625340ce9cbfda92
  * boost-align:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-align/6dcfc6f69223f8fffff4c82db31be7495f17062d
  * boost-any:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-any/798ddb57d8d3bb7a77e424cd5d60f88cc4e6bee7
  * boost-array:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-array/8503faaf38485a913d4daf3efd2076b67bebdc0b
  * boost-assert:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-assert/f07d2980139502309087821ca9a78a9a43a766ec
  * boost-atomic:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-atomic/ed19c1a3b7f57d00917b260e35cd662b28c1f6eb
  * boost-bind:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-bind/1d9372948d3c0fe1da8965e76ded151513f16a52
  * boost-build:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-build/9d4a35930b7414144551dc9c2e00b31bcb0ca8b2
  * boost-concept-check:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-concept-check/c266d37e02d494dcf4c3623bae07775c2a7baa45
  * boost-config:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-config/af2f41ef14357f44466182d783e8cb218d677b1f
  * boost-container:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-container/54ac399bdf18e9c6ab8c9df4d81df4bd86a547d9
  * boost-container-hash:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-container-hash/c5bde693ed8971a4e0ba6e22854386d4054898a8
  * boost-conversion:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-conversion/e99972dc970d3a2229996b240d5dc4362204b6dc
  * boost-core:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-core/f406b6bfcf32cb14d17d1df5e57a1162a2f2d703
  * boost-crc:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-crc/7146ce31ceb0cdce252cdf556502fbd358eb30d5
  * boost-describe:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-describe/631c0781e5719b44d906954a57f3a9a42bb37e8a
  * boost-detail:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-detail/ea40bd02426af4c5ce99995762c452e3a70b5189
  * boost-dynamic-bitset:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-dynamic-bitset/d350f9d70da182a7abddd0cadc649394fb144113
  * boost-exception:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-exception/ef4cfcde16100435562ccebf90edca8cfc4e204c
  * boost-filesystem:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-filesystem/9e51e733f0ddb9e287a6ee03bf9cbb1f3f65f081
  * boost-format:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-format/9562adeb23d81f0bfb89adb5c920ead09b302ec5
  * boost-function:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-function/dce4e73b054f0b611a8d1e0f9efc9dd8cd79071f
  * boost-function-types:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-function-types/1da109d6c53f82fdac9bdf6efa17af3d47dacb54
  * boost-functional:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-functional/3c428ee1b1238a5fbf337fd549d3b3c3afc17930
  * boost-fusion:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-fusion/0c32122b9244e041424874f6657451169ab9522d
  * boost-integer:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-integer/ed95b549c8719f65d4599fb6a612db94043c033f
  * boost-intrusive:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-intrusive/c6ef0a6523433599c5a6e0bebedd4b2caa6b56cf
  * boost-io:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-io/21da7534c21a1dfd1ff5833e330837ad8fa26341
  * boost-iostreams[bzip2,core,lzma,zlib,zstd]:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-iostreams/9a39f5835d29b23ad73b036797cb4f6f000e83c2
  * boost-iterator:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-iterator/06e57442234e51fd436cc31ac285319baaf83e26
  * boost-lexical-cast:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-lexical-cast/54b82af182eb01789c463148630e7e897ae3f3b7
  * boost-math:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-math/61b0ddfd8a1a5e18f97584b57143de200770bd56
  * boost-modular-build-helper:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-modular-build-helper/5c1a238cc20c885e785bc05a36b974deb5e02db7
  * boost-move:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-move/1ce5f627ac0c30d041b9f2981fd3c11e4b504825
  * boost-mp11:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-mp11/80e104d697b7ff7a283c5f7ee83ae25fe1192bd7
  * boost-mpl:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-mpl/e08d5a2eb5d5edfc5a62469dbaa283cf56570d07
  * boost-numeric-conversion:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-numeric-conversion/97842146c6a87912141dd1d4654e1576d12e5cef
  * boost-optional:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-optional/100ff3094e97d23516718679d59d9f86821074b8
  * boost-predef:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-predef/3a0c5194d7a4d8825efff8d67e0c75cae0b058d7
  * boost-preprocessor:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-preprocessor/85819cdd004323b0afe1a361b9e05fdb55ab6838
  * boost-program-options:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-program-options/c39f818a3734e996eb777117eb8d1492d7b1b26f
  * boost-random:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-random/0c97951b0f38d6c9b469b367ac136249e1fa23fd
  * boost-range:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-range/fc4c6c85df6efb4afe27a53499abdcd17da71a69
  * boost-regex:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-regex/d31af2d45fe496fd2d89b734027e72c92d23f1d1
  * boost-smart-ptr:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-smart-ptr/a61a6534966bd4a70b051538a287de38be4345ba
  * boost-static-assert:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-static-assert/e6e55a999a301eb62d7c4da853a6aeb296713943
  * boost-system:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-system/48fb125e52a85de87f4ff03475742d79f84b42cb
  * boost-throw-exception:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-throw-exception/c51c9fa76f925c5824688c6ca9f451716373b5d5
  * boost-tokenizer:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-tokenizer/4a53c730b20cacfbd8e6bbbd8faaeee0720c9fc1
  * boost-tuple:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-tuple/c9173d5394c4b31fdf698b156b542c29031d03fe
  * boost-type-index:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-type-index/d02bcbc9c4ee9a4b27e90ab03010bea7b6c70eb3
  * boost-type-traits:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-type-traits/34773fec432e120d3f2d9dd482461da7c36ef897
  * boost-typeof:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-typeof/ab6e440573d5f3ac964c50b3efc24055500a772f
  * boost-uninstall:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-uninstall/d96807810fa612b038f9268c8df545d421fda1a6
  * boost-unordered:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-unordered/ee2a3d19df6c91036cb772b4747bc7244307cda1
  * boost-utility:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-utility/b94486558c58f29bb5165d952196d448719caae5
  * boost-variant2:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-variant2/3acfc977da209b860fd7cf1365180af919be0eeb
  * boost-vcpkg-helpers:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-vcpkg-helpers/f99477ea8dcfb463d2f0a2cce3c60914c2e5adb4
  * boost-winapi:x64-linux -> 1.82.0#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-winapi/db66242e796c29abd3622cf41d506bee622b4d10
  * bzip2[core,tool]:x64-linux -> 1.0.8#4 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/bzip2/6165360d15e6de08dff3a5f079d51e69908cc55d
  * liblzma:x64-linux -> 5.4.1#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/liblzma/9a3ebbf75e1c3adb76d7e71717c1f03bb291036a
    openssl:x64-linux -> 3.0.8 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/openssl/9cd36489fc65db87c4cd94ac606abd05c1db8f2d
  * snappy:x64-linux -> 1.1.9#5 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/snappy/a97f6aac039dc2b5e6fdac753d66b94cfb408d30
  * vcpkg-cmake:x64-linux -> 2022-12-22 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/vcpkg-cmake/1913f86defd2140d0a6751be2d51952e4090efa4
  * vcpkg-cmake-config:x64-linux -> 2022-02-06#1 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/vcpkg-cmake-config/8d54cc4f487d51b655abec5f9c9c3f86ca83311f
  * vcpkg-cmake-get-vars:x64-linux -> 2023-03-02 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/vcpkg-cmake-get-vars/2e624c2cf12a97a7a802e31ff1d28b9fa6ba9bde
  * zlib:x64-linux -> 1.2.13 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/zlib/ad5a49006f73b45b715299515f31164131b51982
  * zstd:x64-linux -> 1.5.4#2 -- /home/gpadmin/vcpkg/buildtrees/versioning_/versions/zstd/fcc4fe2738d903bc347d20d2d4c836699e273509
Additional packages (*) will be modified to complete this operation.
Restored 24 package(s) from /home/gpadmin/.cache/vcpkg/archives in 55.1 ms. Use --debug to see more details.
Installing 1/69 vcpkg-cmake-config:x64-linux...
Elapsed time to handle vcpkg-cmake-config:x64-linux: 333 us
Installing 2/69 vcpkg-cmake:x64-linux...
Elapsed time to handle vcpkg-cmake:x64-linux: 249 us
Installing 3/69 snappy:x64-linux...
Elapsed time to handle snappy:x64-linux: 484 us
Installing 4/69 zlib:x64-linux...
Elapsed time to handle zlib:x64-linux: 369 us
Installing 5/69 boost-uninstall:x64-linux...
Elapsed time to handle boost-uninstall:x64-linux: 262 us
Installing 6/69 boost-vcpkg-helpers:x64-linux...
Elapsed time to handle boost-vcpkg-helpers:x64-linux: 283 us
Installing 7/69 boost-config:x64-linux...
Elapsed time to handle boost-config:x64-linux: 2.84 ms
Installing 8/69 boost-static-assert:x64-linux...
Elapsed time to handle boost-static-assert:x64-linux: 395 us
Installing 9/69 boost-type-traits:x64-linux...
Elapsed time to handle boost-type-traits:x64-linux: 3.48 ms
Installing 10/69 boost-assert:x64-linux...
Elapsed time to handle boost-assert:x64-linux: 580 us
Installing 11/69 boost-throw-exception:x64-linux...
Elapsed time to handle boost-throw-exception:x64-linux: 526 us
Installing 12/69 boost-core:x64-linux...
Elapsed time to handle boost-core:x64-linux: 1.39 ms
Installing 13/69 boost-tuple:x64-linux...
Elapsed time to handle boost-tuple:x64-linux: 493 us
Installing 14/69 boost-preprocessor:x64-linux...
Elapsed time to handle boost-preprocessor:x64-linux: 4.41 ms
Installing 15/69 boost-io:x64-linux...
Elapsed time to handle boost-io:x64-linux: 667 us
Installing 16/69 boost-utility:x64-linux...
Elapsed time to handle boost-utility:x64-linux: 837 us
Installing 17/69 boost-predef:x64-linux...
Elapsed time to handle boost-predef:x64-linux: 2.06 ms
Installing 18/69 boost-winapi:x64-linux...
Elapsed time to handle boost-winapi:x64-linux: 1.56 ms
Installing 19/69 boost-mp11:x64-linux...
Elapsed time to handle boost-mp11:x64-linux: 863 us
Installing 20/69 boost-variant2:x64-linux...
Elapsed time to handle boost-variant2:x64-linux: 668 us
Installing 21/69 vcpkg-cmake-get-vars:x64-linux...
Elapsed time to handle vcpkg-cmake-get-vars:x64-linux: 685 us
Installing 22/69 boost-modular-build-helper:x64-linux...
Elapsed time to handle boost-modular-build-helper:x64-linux: 704 us
Installing 23/69 boost-build:x64-linux...
Elapsed time to handle boost-build:x64-linux: 9.99 ms
Installing 24/69 boost-system:x64-linux...
Building boost-system:x64-linux...
-- Installing port from location: /home/gpadmin/vcpkg/buildtrees/versioning_/versions/boost-system/48fb125e52a85de87f4ff03475742d79f84b42cb
-- Downloading https://github.com/boostorg/system/archive/boost-1.82.0.tar.gz -> boostorg-system-boost-1.82.0.tar.gz...
-- Cleaning sources at /home/gpadmin/vcpkg/buildtrees/boost-system/src/ost-1.82.0-6fabddb1d7.clean. Use --editable to skip cleaning for the packages you specify.
-- Extracting source /home/gpadmin/vcpkg/downloads/boostorg-system-boost-1.82.0.tar.gz
-- Using source at /home/gpadmin/vcpkg/buildtrees/boost-system/src/ost-1.82.0-6fabddb1d7.clean
-- Getting CMake variables for x64-linux
-- Configuring x64-linux
-- Building x64-linux-dbg
CMake Error at scripts/cmake/vcpkg_execute_build_process.cmake:134 (message):
    Command failed: /home/gpadmin/vcpkg/downloads/tools/cmake-3.27.1-linux/cmake-3.27.1-linux-x86_64/bin/cmake --build . --config Debug --target install -- -v -j17
    Working Directory: /home/gpadmin/vcpkg/buildtrees/boost-system/x64-linux-dbg
    See logs for more information:
      /home/gpadmin/vcpkg/buildtrees/boost-system/install-x64-linux-dbg-out.log

Call Stack (most recent call first):
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg_installed/x64-linux/share/vcpkg-cmake/vcpkg_cmake_build.cmake:74 (vcpkg_execute_build_process)
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg_installed/x64-linux/share/vcpkg-cmake/vcpkg_cmake_install.cmake:16 (vcpkg_cmake_build)
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg_installed/x64-linux/share/boost-build/boost-modular-build.cmake:140 (vcpkg_cmake_install)
  buildtrees/versioning_/versions/boost-system/48fb125e52a85de87f4ff03475742d79f84b42cb/portfile.cmake:14 (boost_modular_build)
  scripts/ports.cmake:168 (include)


error: building boost-system:x64-linux failed with: BUILD_FAILED
Elapsed time to handle boost-system:x64-linux: 3.1 s
Please ensure you're using the latest port files with `git pull` and `vcpkg update`.
Then check for known issues at:
  https://github.com/microsoft/vcpkg/issues?q=is%3Aissue+is%3Aopen+in%3Atitle+boost-system
You can submit a new issue at:
  https://github.com/microsoft/vcpkg/issues/new?title=[boost-system]+Build+error&body=Copy+issue+body+from+%2Fhome%2Fgpadmin%2Fduckdb_iceberg%2Fbuild%2Frelease%2Fvcpkg_installed%2Fvcpkg%2Fissue_body.md

-- Running vcpkg install - failed
CMake Error at /home/gpadmin/vcpkg/scripts/buildsystems/vcpkg.cmake:899 (message):
  vcpkg install failed.  See logs for more information:
  /home/gpadmin/duckdb_iceberg/build/release/vcpkg-manifest-install.log
Call Stack (most recent call first):
  /usr/local/share/cmake-3.18/Modules/CMakeDetermineSystem.cmake:93 (include)
  CMakeLists.txt:19 (project)


CMake Error: CMake was unable to find a build program corresponding to "Unix Makefiles".  CMAKE_MAKE_PROGRAM is not set.  You probably need to select a different build tool.
CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
make: *** [Makefile:52: release] Error 1

Support AWS Glue Catalog

I'd like to add support for "schemes" that prefix the location supplied for Iceberg tables.

For example I'd like to support AWS Glue with a scheme like:

iceberg_scan('aws-glue://[catalog id].[database_name].[table_name]').

The use of this scheme would allow DuckDB to access Iceberg tables where the current snapshot version is stored in a Glue data catalog. The Iceberg extension would call Glue's describe_table() API method, get the current metadata_url and parse the JSON file.

I'd also like to add a companion function called iceberg_scan_tables('aws-glue://[catalog id]') that would scan all databases and tables under the schema and create views that are backed by iceberg_scan(), so this can expose the databases and tables in glue to the end user.

binder error when directly scanning metadata file

Hey,

I am encountering an issue while trying to scan metadata files directly from an Iceberg table located in S3.

Here are the details:

  • Data Source: Iceberg table in S3, written by Flink with AWS Glue Catalog.
  • DuckDB Version: v0.9.1

Query

select * from  iceberg_scan('s3://<...>/metadata/00136-4330ee25-0db5-42a4-8230-035d739f76e7.metadata.json') limit 1;

Error:

Binder Error: Table "iceberg_scan_deletes" does not have a column named "file_path"

Would appreciate any insights or suggestions on what could be causing this issue. Thank you!

IOException when querying table with a list<int> column.

Hello, thanks for your great work on duckdb-iceberg!

I'm encountering an issue where attempting to query an iceberg table with iceberg_scan fails if the table has a column which is a list of ints (list<int>).

table_with_lists # a pyiceberg.table.Table object
table_without_lists # another pyiceberg.table.Table object

duckdb.sql("INSTALL ICEBERG; LOAD ICEBERG;")

# querying the table that doesn't feature any lists works fine
duckdb.sql(f"SELECT * FROM iceberg_scan('{table_without_lists.metadata_location}') LIMIT 10")

# querying the table that *does* have lists throws an error
duckdb.sql(f"SELECT * FROM iceberg_scan('{table_with_lists.metadata_location}') LIMIT 10")
# IOException: IO Error: Invalid field found while parsing field: type

This seems to be related to an error parsing the table schema. I was wondering if there maybe was no logic for complex types in the duckdb-iceberg extension, but it appears that there is.

Hibben partition pruning

Iceberg has support for hidden partitioning. Data written to a partitioned table will be split up based on the column and the applied transform:

image

A partitioned table can tremendously speed up the table, since each ManifestList keeps a summary of the partition range. This way on a high level, the files can be pruned.

Important to note here that partitions can evolve over time with Iceberg. For example, when a partition is updated from daily to hourly partitions due to increased data, then new data will be written using the new partitioning strategy. The old data will not change and will remain partitioned on a daily basis until the data is rewritten.

Pruning of manifest files

Manifest list:

{
	"manifest_path": "s3://warehouse/nyc/taxis/metadata/65085354-ca41-4ae0-b454-04bc1a669f19-m0.avro",
	"manifest_length": 16797,
	"partition_spec_id": 0,  # field-id: 502
	"content": 0,
	"sequence_number": 3,
	"min_sequence_number": 3,
	"added_snapshot_id": 864191924052698687,
	"added_data_files_count": 42,
	"existing_data_files_count": 0,
	"deleted_data_files_count": 0,
	"added_rows_count": 3627882,
	"existing_rows_count": 0,
	"deleted_rows_count": 0,
	"partitions": {  # field-id: 507
		"array": [{
			"contains_null": false,
			"contains_nan": {
				"boolean": false
			},
			"lower_bound": {
				"bytes": "¤7\u0000\u0000"
			},
			"upper_bound": {
				"bytes": "·J\u0000\u0000"
			}
		}]
	}
}

Since there can be many different partition strategies over time, the evaluators are constructed lazily on PyIceberg. Based on the 502: partition_spec_id we construct an evaluator:

        manifest_evaluators: Dict[int, Callable[[ManifestFile], bool]] = KeyDefaultDict(self._build_manifest_evaluator)

This looks up the partition spec, and creates the evaluator:

    def _build_partition_projection(self, spec_id: int) -> BooleanExpression:
        project = inclusive_projection(self.table.schema(), self.table.specs()[spec_id])
        return project(self.row_filter)

This requires the original schema, since the filter references the original fields. Consider the following schema:

schema {
  1: id int
}

For example, id = 10, and there is a bucket(id) partitioning on the table. Then the spec is as:

[
     1000: id: bucket[22](1)
]

The PartitionSpec has a single field, that starts from field-id 1000. It references field-id 1, which corresponds with id. This way we decouple the filtering from the naming.

The most important part where it is filtered:

class InclusiveProjection(ProjectionEvaluator):
    def visit_bound_predicate(self, predicate: BoundPredicate[Any]) -> BooleanExpression:
        parts = self.spec.fields_by_source_id(predicate.term.ref().field.field_id)

        result: BooleanExpression = AlwaysTrue()
        for part in parts:
            # consider (d = 2019-01-01) with bucket(7, d) and bucket(5, d)
            # projections: b1 = bucket(7, '2019-01-01') = 5, b2 = bucket(5, '2019-01-01') = 0
            # any value where b1 != 5 or any value where b2 != 0 cannot be the '2019-01-01'
            #
            # similarly, if partitioning by day(ts) and hour(ts), the more restrictive
            # projection should be used. ts = 2019-01-01T01:00:00 produces day=2019-01-01 and
            # hour=2019-01-01-01. the value will be in 2019-01-01-01 and not in 2019-01-01-02.
            incl_projection = part.transform.project(name=part.name, pred=predicate)
            if incl_projection is not None:
                result = And(result, incl_projection)

        return result

Here we start with AlwaysTrue(), so until there is any evidence that the manifests are not relevant, it will be included. It will return a callable that can be bound to the row_filter:

    def project(self, expr: BooleanExpression) -> BooleanExpression:
        #  projections assume that there are no NOT nodes in the expression tree. to ensure that this
        #  is the case, the expression is rewritten to push all NOT nodes down to the expression
        #  leaf nodes.
        #  this is necessary to ensure that the default expression returned when a predicate can't be
        #  projected is correct.
        return visit(bind(self.schema, rewrite_not(expr), self.case_sensitive), self)

Pruning of the manifests

Manifest file:

When the manifest is considered relevant based on the the range:

[{
  "status": 1,
  "snapshot_id": {
    "long": 4554453935356538000
  },
  "data_file": {
    "file_path": "s3://warehouse/nyc/taxis/data/tpep_pickup_datetime_day=2021-04-29/00003-10-989122fb-e6e6-449e-8290-57871f57028f-00001.parquet",
    "file_format": "PARQUET",
    "partition": {
      "tpep_pickup_datetime_day": {
        "int": 18746 // Partition: Thursday, April 29, 2021
      }
    },
...
}

This re-uses the same partition projection:

    @cached_property
    def partition_filters(self) -> KeyDefaultDict[int, BooleanExpression]:
        return KeyDefaultDict(self._build_partition_projection)

And the meat is in this function:

    def _build_partition_evaluator(self, spec_id: int) -> Callable[[DataFile], bool]:
        spec = self.table.specs()[spec_id]
        partition_type = spec.partition_type(self.table.schema())
        partition_schema = Schema(*partition_type.fields)
        partition_expr = self.partition_filters[spec_id]

        evaluator = visitors.expression_evaluator(partition_schema, partition_expr, self.case_sensitive)
        return lambda data_file: evaluator(data_file.partition)

We create a schema from the partition spec. We create a struct from the filter struct<int>, and we create an evaluator on top of it to see if there are rows that might match, based on the partition specification.

I hope this helps, I'm happy to provide more context, and when I get the time, I'll see if I can brush off my cpp skills.

Can't get AWS Athena Iceberg metadata because there's no `metadata/version-hint.text` file

I went ahead and gave this still-very-early extension a try, using my own Iceberg data set that I build with AWS Athena. No dice, I get this:

D SELECT *  FROM iceberg_metadata('s3://my-bucket/path/to/table/folder');
Error: Invalid Error: HTTP Error: Unable to connect to URL "https://my-bucket.s3.amazonaws.com/path/to/table/folder/metadata/version-hint.text": 404 (Not Found)

Looking at the current source code I see that it comes from here.

Searching the web a bit for relevant info I find this passage in this article:

Iceberg catalog

Within the catalog, there is a reference or pointer for each table to that table’s current metadata file. For example, in the diagram shown above, there are 2 metadata files. The value for the table’s current metadata pointer in the catalog is the location of the metadata file on the right.

What this data looks like is dependent on what Iceberg catalog is being used. A few examples:

  • With HDFS as the catalog, there’s a file called version-hint.text in the table’s metadata folder whose contents is the version number of the current metadata file.
  • With Hive metastore as the catalog, the table entry in the metastore has a table property which stores the location of the current metadata file.
  • With Nessie as the catalog, Nessie stores the location of the current metadata file for the table.

Obviously Athena is somehow storing this metadata in the AWS Glue Catalog (perhaps accessible with the same protocol as a Hive metastore?), and the extension doesn't integrate that yet.

I do realize this is an extremely early stage for the extension, just holler in this ticket when you think it might work (or just if you're puzzled whether it might) and I'm happy to test this out some other time.

Predicate Pushdown for scans

It currently appears that predicate pushdown isn't applied before calculating what Parquet files are included in scanning the table's data.

The AVRO manifest files include statistics about each Parquet file. Those should be leveraged to reduce the number of files that need to be read. Additionally there is the byte offset for each row group in each Parquet file. That can prevent the need to read the footer of the Parquet field and allow better concurrent scanning.

Can't use the extension if my data catalog did not create a version-hint.text file

My s3 bucket with iceberg (picture below) cannot be queried with

iceberg_scan('s3://bucket/iceberg', ALLOW_MOVED_PATHS=true)

nor

iceberg_scan('s3://bucket/iceberg/*', ALLOW_MOVED_PATHS=true)

In particular the system is trying to find a very specific file (so the * pattern gets ignored):

duckdb.duckdb.Error: Invalid Error: HTTP Error: Unable to connect to URL https://bucket.s3.amazonaws.com/iceberg/metadata/version-hint.text

Unfortunately that file does not exist in my iceberg/ folder, nor in any of the iceberg/sub/metadata folders. Compared to the data zip in duckdb docs about iceberg, it is clear "my iceberg tables" are missing that file, which is important for the current implementation.

That said, version-hint seems something we do not really need, as that info can default to a version or being an additional parameter perhaps (instead of failing if the file is not found)?

Original discussion with @Alex-Monahan in dbt Slack is here: note that I originally got pointed to this as a possible cause, so perhaps reading a table that is formally Iceberg is not really independent from the data catalog it belongs to?

s3_structure

Compiler warnings

/__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp: In function ‘std::string duckdb::IcebergManifestContentTypeToString(duckdb::IcebergManifestContentType)’:
Warning: /__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp:30:1: warning: control reaches end of non-void function [-Wreturn-type]
   30 | }
      | ^
/__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp: In function ‘std::string duckdb::IcebergManifestEntryStatusTypeToString(duckdb::IcebergManifestEntryStatusType)’:
Warning: /__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp:43:1: warning: control reaches end of non-void function [-Wreturn-type]
   43 | }
      | ^
/__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp: In function ‘std::string duckdb::IcebergManifestEntryContentTypeToString(duckdb::IcebergManifestEntryContentType)’:
Warning: /__w/duckdb/duckdb/build/release/_deps/iceberg_extension_fc-src/src/include/iceberg_types.hpp:56:1: warning: control reaches end of non-void function [-Wreturn-type]
   56 | }

Probably want to add a default: throw InternalException(...) to silence these warnings

Support for the `file://` prefix

Most IO implementations allow the file:// prefix to indicate that the data is local. Would be great if DuckDB can support this as well 👍

Unable to do JOIN between to iceberg based tables

Example of what I'm trying to do:

select * from iceberg_scan('s3://<url1>.metadata.json') as t1 join iceberg_scan('s3://t2.metadata.json') as t2 on 1=1

When running the above I get this error:
Binder Error: Duplicate alias \"iceberg_scan_data\" in query!

Although I'm passing an alias for each iceberg_scan operation, it's still assigning the same iceberg_scan_data alias to both.

A suggestion would be to add the alias ass a parameter to the function.

ORC Files

Iceberg tables can use ORC files for storage, this extension should flag that error.

Write Support

Is there write support on the roadmap? If so, is there an ETA? I'll have to plan a future project and would like to keep using duckdb.

Iceberg REST Catalog Support

Hey, team!

Very excited about the duckdb v0.9 support for iceberg!

I currently use a rest catalog for my iceberg tables and was hoping to be able to wire up duckdb to that rather than point it to the actual underlying data/metadata files.

If this is available, I'd love to use it -- otherwise, I'd be happy to jump in and start coding if this feature is new.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.