Giter Club home page Giter Club logo

Comments (15)

rgbkrk avatar rgbkrk commented on July 30, 2024

Building a binder image

This is used at the API endpoint (e.g. api.mybinder.org) to work with binder image resources.

Creating a new binder

POST /binders/CodeNeuro/notebooks/ HTTP 1.1
Content-Type: application/json

{
  "name": "codeneuro-notebooks",
  "repo": "https://github.com/CodeNeuro/notebooks",
  "requirements": "repo/requirements.txt",
  "notebooks": "repo/notebooks",
  "services": [
    {
      "name": "spark",
      "version": "1.4.1",
      "params": {
        "heap_mem": "4g",
        "stack_mem": "512m"
      }
    }
  ]
}

That's copied straight from your current API, though it would be good to spec those out.

Detail on a binder

Right now the GET on /apps/<organization>/<repo>/ returns the redirect URI (yes, tmpnb does this too because of its limited purpose, but they're already launched).

In my opinion this should tell you about the resource, returning what the spec was in the POST as well as the status.

GET /binders/CodeNeuro/notebooks/ HTTP 1.1

would then return

{
  "name": "codeneuro-notebooks",
  "repo": "https://github.com/CodeNeuro/notebooks",
  "requirements": "repo/requirements.txt",
  "notebooks": "repo/notebooks",
  "services": [
    {
      "name": "spark",
      "version": "1.4.1",
      "params": {
        "heap_mem": "4g",
        "stack_mem": "512m"
      }
    }
  ]
}

That could include that status as well.

Beyond that, I think a HEAD request makes sense here for checking to see if a binder exists.

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

Launching a binder

Right now this is part of the GithubBuildHandler as a GET on the /apps/ resource. This piece could actually be distant from GitHub, relying only on image names (those that have been built, whitelisted, whatever). In the Docker API (just as a reference), they POST to /containers/create with the payload.

Since we're talking "precanned" images (in the sense they were built prior or already existing) that get launched either by a user visiting a resource or via AJAX call by JavaScript, I think we can pick a solid resource name. I'm tilted towards spawn since it's a noun and we've already been using it in tmpnb. Picked containers after discussion on gitter.

Since we're creating a container, we'd want to start this off as a POST with a GET retrieving that same information.

POST /containers/CodeNeuro-notebooks/ HTTP 1.1
Accept: application/json

Which would return

{
  "id": "12345"
}

If the resource was immediately available, then it could include the location. Otherwise, retrieving the location for that specific container would be by GET

GET /containers/CodeNeuro-notebooks/12345

which returns

{
  "location": "...",
  "id": "12345"
}

You may ask yourself then, what if someone GETs the resource directly?

GET /containers/CodeNeuro-notebooks/ HTTP 1.1
Authorization: 5c011f6b474ed90761a0c1f8a47957a6f14549507f7929cc139cbf7d5b89

This should return all of the current containers that user is allowed to see.

[
  {
    "location": "...",
    "id": "12345",
    "uri": "/containers/CodeNeuro-notebooks/12345"
  },
  {
    "location": "...",
    "id": "787234",
    "uri": "/containers/CodeNeuro-notebooks/787234"
  }
]

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

The last thing I posted, about returning all the currently spawned containers, would be super helpful for operations (as I've faced with tmpnb). It's probably worth thinking about authentication sooner rather than later, even if for your own administration.

It's easy to defer to a separate authentication store, relying on an LRU key store (that's what I made https://github.com/rgbkrk/lru-key-store for, when deferring to a separate identity service) to keep yourself from making repeated calls to, e.g. GitHub or another provider.

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

Working with pools of pre-allocated binders

Thinking about the pooling that tmpnb does (and that I would hope for in binder), I imagine we could have an endpoint at /pool/ to set up capacities (and inspect allocations) for images:

GET /pool/{imageName} HTTP 1.1

returns

{
  "running": 123,
  "available": 12,
  "minPool": 1
}

Updating the pool (by POST or PUT):

POST /pool/{imageName}
Authorization: 9f66083738d8e8fa48e2f19d4bd3bdb4821fa2d3fdc7d84e4228ded5e219

{
  "minPool": 512
}

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

How would you feel if spawn and pool both take image names (binders?) and it's up to the underlying implementation for whether they'll run that image or not?

from binder.

andrewosh avatar andrewosh commented on July 30, 2024

That seems good to me. I'm imagining that would look something like:

  • Send the authorized PUT to /pool/ with the format you described above to create the pool
  • Check imageName against a whitelist of pool-able images (presumably only a small set of single-kernel images)
  • If imageName is either forbidden or does not exist, synchronously return some error code
  • Update the number of containers allocated to the pool with an authorized POST

Sound right?

Thinking along these lines, we should definitely come up with a plan for decoupling image names from GitHub repos. There's currently an empty class called OtherSourceHandler which was intended to complement GithubBuildHandler, but for images from arbitrary sources (whose names can't be uniquely constructed from an organization/name combo). Perhaps that change should go in another issue...

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

Yeah, sorry I brain dumped it all here. This would work well on a wiki once we're specced out. I imagine I'll use the same API on revision to tmpnb, which in my eyes would be a fairly restricted launcher that relies on pure Docker API (including swarm support out of the box).

For binder images that get created, I'd think of those as existing in a particular namespace - then you'd assume they're trusted.

Thinking about what I wrote though, maybe spawn and pool both take a binderName since it's coupled with services, etc.

from binder.

freeman-lab avatar freeman-lab commented on July 30, 2024

This is awesome @rgbkrk ! I really like the API design and strategy. A couple comments (after chatting with @andrewosh ):

Great to split the GET and POST for the spawn, we wanted to decouple those anyway to make it easier to support a loading screen / query for finished deployments.

Not totally clear the best way to "spawn" an image that's already been "spawned" as part of a pool. Does it make sense that a POST to spawn either

(1) returns a location immediately if the the image was part of a pool, which we can figure out via a GET on the pool
or
(2) triggers an actual "spawn", and just returns an id, and then polling or websockets could be used to check whether its been deployed and has a valid location.

Reasonable, or is there a better design?

Another question, re: naming, is whether we only support images linked to GitHub repos, in which case we can have a one-to-one mapping from repos <-> names. This will of course work for binders built from repos (by design), but for the "standard" images we want to support (e.g. for the lighterweight kernels), do we need more flexibility? Or can we assume those are linked to repos too?

from binder.

freeman-lab avatar freeman-lab commented on July 30, 2024

and 👍 to putting this into a wiki once specced out, and then eventually into proper documentation =)

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

(2) triggers an actual "spawn", and just returns an id, and then polling or websockets could be used to check whether its been deployed and has a valid location.

I was curious about that too. I'll go ahead and update the above to reflect that, as I think the POST should return immediately. For now we can stick with long polling on the GET.

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

Not totally clear the best way to "spawn" an image that's already been "spawned" as part of a pool.

Continuum once told me that I was cheating, since they're pre-spawned. I say it's an optimization we needed.

As you've said, they are already spawned when in the pool. Perhaps they're hatching at this point? Being allocated? Other nomenclature:

  • /hatchling/
  • /container/

These are the states we're wondering about:

  • Allocated, unused containers
  • Allocated, being used containers
  • Containers being culled
  • Containers being allocated

Perhaps this is a relationship between a container and a userContainer. That or a user has a container resource.

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

What do you think about calling the resource we bundle/build an environment? To a user we might call them binders, but in other contexts they're a built out resource for on demand computation coupled with:

  • specification of software to install
  • notebooks, data, code, to pull in (from a repo or otherwise - ok to stick with GH for the moment)
  • linked services
  • volumes, etc.
  • the application to serve/run

The last one is important since there are many uses for kernels beyond notebooks. As @parente put it on a hackpad we were mocking up for a Kernel Service API:

... launch this random.choice(['kernel', 'notebook', 'dashboard-app']) ...

As long as we make an API spec we're happy with, we'll probably end up with at least these three cluster managers

  • Kubernetes - Binder
  • Marathon / Mesos - mentioned as done by IBM
  • Docker / Swarm - tmpnb, dockerspawner (jupyterhub), ephemit

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

Once you have the kernels decoupled as the only thing that is provisioned, you can serve notebooks from a separate content store. That content store could even be GitHub itself. 👍 Doing it with gists is pretty simple as well. This also opens up the use of Google Drive, firebase, etc. as content stores that also allow for realtime collaboration.

However, we still want to be able to access datasets, etc. many times in typical POSIX ways (CSVs, etc.) straight from repositories. Hence why we either need to mount volumes, build them into the container, or expose services for data.

I do have a prototype that cheats and uses the notebook contents API to inject data from a repo. That code is not being used, but I've tried it out locally for fun. Example that renders a notebook from a gist, populates thebe cells, and connects you to a tmpnb kernel.

from binder.

freeman-lab avatar freeman-lab commented on July 30, 2024

Re: naming -- which is the most important part ;) -- we're way into aiming for a more general spec, to be used here and elsewhere, and here's our current favorite (with pithy summaries from above):

build
POST defines and creates a build, GET returns info on the build

container
POST triggers a launch (if unavailable) or returns a location (if available), GET returns a location (once available)

pool
POST or PUT sets up capacities, GET inspects allocations

build was formerly binder, the new name is more general, and container was formerlyspawn, but the new name is more allowing of the fact that these containers may have already been deployed (in a pool).

from binder.

rgbkrk avatar rgbkrk commented on July 30, 2024

build makes sense, though the attached services make it more like a deployment specification/template.

from binder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.