sutoiku / puffin Goto Github PK
View Code? Open in Web Editor NEWServerless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg
Home Page: http://PuffinDB.io
License: MIT License
Serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg
Home Page: http://PuffinDB.io
License: MIT License
DuckDB supports the HUGEINT
type for signed sixteen-byte integers, but Apache Iceberg does not. How should values of this type be cast? A pragmatic option would be to use long
64-bit signed integers, but this would result in significant loss of information. Assuming this is acceptable, what should be done with values that are out of bounds? Otherwise, which alternative options should be considered?
Link: Types
Every query must be logged into an Iceberg table using an INSERT INTO
query. Batching multiple such queries into one would make it more efficient, but would require some queuing mechanism. Since low latency is not an absolute requirement for query logs, Amazon SQS could be used for such a purpose, but should other options be considered?
In EDDI.md you propose a SELECT THROUGH
syntax (I think this was previously SELECT REMOTE) like
SELECT THROUGH 'https://myPuffinDB.com/' * FROM remoteTable;
I would suggest that you rather make THROUGH
a separate clause since while the current proposal might read more naturally, IMHO it doesn't make so much sense semantically since it sits inside the SELECT
clause. The SELECT
clause should be about specifying the projections of the relation and it's not clear to me that THROUGH
relates to that.
The following still reads like sensible English while having more semantic separation:
THROUGH 'https://myPuffinDB.com/' SELECT * FROM remoteTable;
Moving forward, DuckDB will be found everywhere, both client-side and cloud-side. When used client-side, what would be the best way to integrate it with PuffinDB running cloud-side? As a developer, I would like to make a query from my DuckDB client, have it executed cloud-side by PuffinDB, have its result streamed with Apache Arrow to my client, and have that result saved as a local Apache Parquet file, or loaded into my local DuckDB client with a CREATE TABLE
. With that in mind, what needs to be changed in DuckDB to make that dataflow as seamless as possible?
And could this dataflow be further improved upon?
PuffinDB will include a distributed query planner. This complex component should probably be developed on top of an existing framework implementing a relational algebra. The two main options are Apache Calcite and Substrait. Which one should we use? Either should work, but we must pick one. This decision will have major and lasting impacts on the codebase.
Given Athena's ability to execute SQL queries in a distributed and serverless fashion against a data lakehouse
By default, query results will be cached on the Object Store (S3 on AWS). But query results that are requested often could be cached using Amazon CloudFront.
Hello!
I've been thinking a little about authentication and authorisation.
A few assumptions:
My rough proposal is that:
@ghalimi have you thought about auth at all? I am happy to flesh this out a little if the above is agreeable. I think the most important point is leaning into the cloud that puffin is hosted on (point 2 above).
Can the Substrait cross-language serialization for relational algebra help in any way?
Apache Iceberg supports the fixed(L)
type for a fixed-length character string of length L
. Should a similar type be added to DuckDB? Without it, values of this type are cast to the VARCHAR
type, which encoding might lead to performance degradation and | or memory waste.
Link: Types
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.