Giter Club home page Giter Club logo

Comments (10)

rupurt avatar rupurt commented on June 15, 2024 2

@jpmmcneill great issue to raise! I feel like this is a super important feature for many companies. Many places won't even consider a tool unless it fits within their data governance framework.

Some good examples to take inspiration from are:

A metastore will become a performance bottleneck as all queries would be routed through it. It would be useful to create & track baseline performance numbers on all competing metastore implementations.

from puffin.

ghalimi avatar ghalimi commented on June 15, 2024 1

Indeed. In fact, I would go even further. It's not so much about what PuffinDB wants to be, but rather what PuffinDB should be. What is legitimate? What is realistic? And how do people want to use PuffinDB, in relation to everything else.

Most developers tend to view their piece of software as being at the center, with everything else revolving around it. I don't see PuffinDB that way. I see it as a layer within one tall stack. The layer below is the Data Lake (Iceberg, Delta Lake, Hudi), and I tend to think that it will play a critical role with respect to authorization. But if it does not, I am pretty sure that another layer will do that, and this layer is unlikely to be the one occupied by PuffinDB.

The reason for the latter is this: authorization is needed by many different things, and PuffinDB is only one among many. Therefore, it can't be PuffinDB's responsibility to offer it.

At an abstract level, I view the problem of authorization as one for which a relation must be established between three entities:

  • an actor (e.g. a user, a role)
  • an action (e.g. UPDATE)
  • a resource (e.g. a row in a table)

The user directory (or something related to it) manages the set of actors. The Data Lake defines the set of actions and the resources they can be performed against. But we should keep in mind that other systems will define other sets of actions and resources for which the same can be done. Therefore, the Data Lake is not really special in that respect.

In other words, the authorization system defines these triplets (of variations on the same theme) and makes sure that they are enforced. This enforcement is (or needs to be) performed in large parts by the Data Lake. Therefore, the only missing piece is the metadata itself. Where should it go? I'm not sure yet. Could it be done by PuffinDB temporarily if we can't find any better alternative? Maybe. But if we do that, we should make it clear that it is not ideal, and we should package this functionality in a very discrete manner, so that it can be externalized easily.

from puffin.

jpmmcneill avatar jpmmcneill commented on June 15, 2024

I am unsure if datalakes come with roles attached. If they do, then I guess there is less work for puffin to implement!

from puffin.

ghalimi avatar ghalimi commented on June 15, 2024

@jpmmcneill Thanks a lot for taking the lead on this.

Two very different things here, but they should be addressed with the same philosophy: leverage existing cloud services as much as possible. We should support all possible authentication models, and we should not develop our own authorization model. For authentication, supporting all possible mechanisms will be painful, but we don't have to be super creative. For authorization, getting the right abstraction level is critical.

My understanding is that the data lake will be where the action takes place. Therefore, the DuckDB client used as PuffinDB client should not have to concern itself with this. In other words, users will be authenticated, and the authentication layer will bring whatever role/group/credential along. From there, these will be passed to the lake, which will be responsible for performing the right authorization.

The only piece that we might have to handle, under the lake's supervision, is the role-based filtering of data. But I would want the table formats to mature a bit more before getting too deep in that area.

from puffin.

jpmmcneill avatar jpmmcneill commented on June 15, 2024

Thank you both. I agree with @ghalimi that this is definitely two different things!

In summary:

  1. A yes/no auth is sufficient for some MVP version. This should be done purely via the duckdb client. The way this will happen is probably still todo?
  2. In future, Puffin might come with a metastore as @rupurt mentioned. This would help with governance (completely agreed that this is very important).

A rough illustration...

flowchart TD
    ClientWithoutAuth[Client Without Auth] --> AuthWithIAM
    AuthWithIAM[Auth via IAM] --> ClientWithAuth
    ClientWithAuth[Client With Auth] --> InitialRelease{Initial Release}
    InitialRelease ---> NoMetaStore(Once Authd, the Client can do anything\nie. drop tables, create new ones etc.)
    ClientWithAuth -.-> FutureVersion{Future Version}
    FutureVersion ---> MetaStore(Supported Features... NB. TBC)
    MetaStore -.-> Roles[Roles,  Inherited from IAM]
    MetaStore -.-> Privliges

from puffin.

ghalimi avatar ghalimi commented on June 15, 2024

I don't think that PuffinDB needs its own metastore. Ideally, everything related to authorization should be handled by the data lake, don't you think?

from puffin.

jpmmcneill avatar jpmmcneill commented on June 15, 2024

I don't think that PuffinDB needs its own metastore. Ideally, everything related to authorization should be handled by the data lake, don't you think?

I suppose that is reasonable, but afaik some data lakes don't do this well. For example, I couldn't find anything on metastores / roles etc for delta when looking.

So I think the question is really, how much feature does puffin want to assume in this area? My situation means that I'd see puffin replacing a CDWH, and as such it'd be very very nice for puffin to have those features out of the box.

However, if puffin only wants to act as a compute engine for datalakes (and not provide some sugar around the edges too), then not building out these features would make sense.

from puffin.

jpmmcneill avatar jpmmcneill commented on June 15, 2024

Great - thank you @ghalimi. That makes sense, and the advantage of seeing it as a layer within one tall stack is the real advantage for the user that layers can relatively easily be switched around, independently of other layers (up to how they interface).

The scope of the issue probably doesn't make sense anymore, as I wasn't fully aware of your thinking on this area - but I am now! Happy to close this off if you are 😄

from puffin.

tobilg avatar tobilg commented on June 15, 2024

Honestly, I would rather try not to over-engineer things in this space, as building a bullet-proof authentication/authorization system is inherently hard.

The public cloud providers (AWS and Azure) already have their IAM systems in place, and PuffinDB would be able to use these via the credential chains / roles passed to the services where PuffinDB will run (VMs, containers, functions), so my recommendation is to use these.

The functionality DuckDB currently has is sufficient when the data is accessed via Object Stores, such as S3. What it lacks is the ability to pass credentials for HTTP requests, so if you want to be able to query data via HTTP, DuckDB needs to add this functionality first. I created a feature request regarding this a while ago: duckdb/duckdb#5972 So far, I haven't heard something back whether/when the will implement this.

from puffin.

ghalimi avatar ghalimi commented on June 15, 2024

I tend to agree. This still leaves the authorization question open, but I am confident that we'll find an elegant way to deal with it. The less we do on that front, the better. Not that we should ignore the problem, but its solution probably relies on using the right building blocks offered by third-parties, rather than developing our own thing.

from puffin.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.