sutoiku / puffin Goto Github PK

View Code? Open in Web Editor NEW

283.0 26.0 12.0 18.25 MB

Serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg

Home Page: http://PuffinDB.io

License: MIT License

arrow duckdb iceberg serverless

puffin's People

Contributors

Stargazers

Watchers

Forkers

quanpinjie rupurt mxmzdlv dorsyd phe voberoi yusufozturk zephyrsz sksundaram-learning ayuryshev chongzhang-vis

puffin's Issues

How should values of DuckDB's HUGEINT be cast to an Iceberg type?

DuckDB supports the HUGEINT type for signed sixteen-byte integers, but Apache Iceberg does not. How should values of this type be cast? A pragmatic option would be to use long 64-bit signed integers, but this would result in significant loss of information. Assuming this is acceptable, what should be done with values that are out of bounds? Otherwise, which alternative options should be considered?

Link: Types

How query logs should be queued for batch INSERT INTO?

Every query must be logged into an Iceberg table using an INSERT INTO query. Batching multiple such queries into one would make it more efficient, but would require some queuing mechanism. Since low latency is not an absolute requirement for query logs, Amazon SQS could be used for such a purpose, but should other options be considered?

SELECT THROUGH

In EDDI.md you propose a SELECT THROUGH syntax (I think this was previously SELECT REMOTE) like

SELECT THROUGH 'https://myPuffinDB.com/' * FROM remoteTable;

I would suggest that you rather make THROUGH a separate clause since while the current proposal might read more naturally, IMHO it doesn't make so much sense semantically since it sits inside the SELECT clause. The SELECT clause should be about specifying the projections of the relation and it's not clear to me that THROUGH relates to that.

The following still reads like sensible English while having more semantic separation:

THROUGH 'https://myPuffinDB.com/' SELECT * FROM remoteTable;

What would be the best way to integrate a client-side DuckDB engine with PuffinDB?

Moving forward, DuckDB will be found everywhere, both client-side and cloud-side. When used client-side, what would be the best way to integrate it with PuffinDB running cloud-side? As a developer, I would like to make a query from my DuckDB client, have it executed cloud-side by PuffinDB, have its result streamed with Apache Arrow to my client, and have that result saved as a local Apache Parquet file, or loaded into my local DuckDB client with a CREATE TABLE. With that in mind, what needs to be changed in DuckDB to make that dataflow as seamless as possible?

And could this dataflow be further improved upon?

Should we use Apache Calcite or Substrait?

PuffinDB will include a distributed query planner. This complex component should probably be developed on top of an existing framework implementing a relational algebra. The two main options are Apache Calcite and Substrait. Which one should we use? Either should work, but we must pick one. This decision will have major and lasting impacts on the codebase.

How does the read performance compare to AWS Athena?

Given Athena's ability to execute SQL queries in a distributed and serverless fashion against a data lakehouse

Should we consider adding support for CloudFront to cache certain query results?

By default, query results will be cached on the Object Store (S3 on AWS). But query results that are requested often could be cached using Amazon CloudFront.

This idea was originally suggested by @TobiM.

Authentication and Authorization

Hello!

I've been thinking a little about authentication and authorisation.

A few assumptions:

DuckDB does not implement roles. I am 90% sure of this, based on a quick scan of documentation + my working experience.
Such an approach (possibly inherited / controlled by IAM / similar concepts in GCS & Azure) is the way to go
Extensions in DuckDB are helpful

My rough proposal is that:

Puffin builds a (potentially lightweight) role system (obviously there is some work here).
This is configured for each user of puffin, and this leverages / uses cloud services to set this up (eg. parameter store in AWS)
This is configured at run time for puffin users via configuration variables in DuckDB. This is a pattern that works well for s3, but as the allowed configurations are limited in DuckDB. Thankfully, extensions allow for these configuration variables to be added to

@ghalimi have you thought about auth at all? I am happy to flesh this out a little if the above is agreeable. I think the most important point is leaning into the cloud that puffin is hosted on (point 2 above).

Can Substrait help in any way?

Can the Substrait cross-language serialization for relational algebra help in any way?

Should a FIXED type for fixed-length character strings be added to DuckDB?

Apache Iceberg supports the fixed(L) type for a fixed-length character string of length L. Should a similar type be added to DuckDB? Without it, values of this type are cast to the VARCHAR type, which encoding might lead to performance degradation and | or memory waste.

Link: Types

sutoiku / puffin Goto Github PK

puffin's People

Contributors

Stargazers

Watchers

Forkers

puffin's Issues

How should values of DuckDB's HUGEINT be cast to an Iceberg type?

How query logs should be queued for batch INSERT INTO?

SELECT THROUGH

What would be the best way to integrate a client-side DuckDB engine with PuffinDB?

Should we use Apache Calcite or Substrait?

How does the read performance compare to AWS Athena?

Should we consider adding support for CloudFront to cache certain query results?

Authentication and Authorization

Can Substrait help in any way?

Should a FIXED type for fixed-length character strings be added to DuckDB?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent