As promised during a video chat yesterday, I'm going to post about API design that I'v

Launching a binder Right now this is part of the <code class="notr

This is awesome <a class="user-mention notranslate" data-hovercard-type="user" data-ho

API Design for binder resources,about binder-project/binder

Comments (15)

rgbkrk commented on July 30, 2024

Building a binder image

This is used at the API endpoint (e.g. api.mybinder.org) to work with binder image resources.

Creating a new binder

POST /binders/CodeNeuro/notebooks/ HTTP 1.1
Content-Type: application/json

{
  "name": "codeneuro-notebooks",
  "repo": "https://github.com/CodeNeuro/notebooks",
  "requirements": "repo/requirements.txt",
  "notebooks": "repo/notebooks",
  "services": [
    {
      "name": "spark",
      "version": "1.4.1",
      "params": {
        "heap_mem": "4g",
        "stack_mem": "512m"
      }
    }
  ]
}

That's copied straight from your current API, though it would be good to spec those out.

Detail on a binder

Right now the GET on /apps/<organization>/<repo>/ returns the redirect URI (yes, tmpnb does this too because of its limited purpose, but they're already launched).

In my opinion this should tell you about the resource, returning what the spec was in the POST as well as the status.

GET /binders/CodeNeuro/notebooks/ HTTP 1.1

would then return

{
  "name": "codeneuro-notebooks",
  "repo": "https://github.com/CodeNeuro/notebooks",
  "requirements": "repo/requirements.txt",
  "notebooks": "repo/notebooks",
  "services": [
    {
      "name": "spark",
      "version": "1.4.1",
      "params": {
        "heap_mem": "4g",
        "stack_mem": "512m"
      }
    }
  ]
}

That could include that status as well.

Beyond that, I think a HEAD request makes sense here for checking to see if a binder exists.

from binder.

rgbkrk commented on July 30, 2024

Launching a binder

Right now this is part of the GithubBuildHandler as a GET on the /apps/ resource. This piece could actually be distant from GitHub, relying only on image names (those that have been built, whitelisted, whatever). In the Docker API (just as a reference), they POST to /containers/create with the payload.

Since we're talking "precanned" images (in the sense they were built prior or already existing) that get launched either by a user visiting a resource or via AJAX call by JavaScript, I think we can pick a solid resource name. ~~I'm tilted towards spawn since it's a noun and we've already been using it in tmpnb.~~ Picked containers after discussion on gitter.

Since we're creating a container, we'd want to start this off as a POST with a GET retrieving that same information.

POST /containers/CodeNeuro-notebooks/ HTTP 1.1
Accept: application/json

Which would return

{
  "id": "12345"
}

If the resource was immediately available, then it could include the location. Otherwise, retrieving the location for that specific container would be by GET

GET /containers/CodeNeuro-notebooks/12345

which returns

{
  "location": "...",
  "id": "12345"
}

You may ask yourself then, what if someone GETs the resource directly?

GET /containers/CodeNeuro-notebooks/ HTTP 1.1
Authorization: 5c011f6b474ed90761a0c1f8a47957a6f14549507f7929cc139cbf7d5b89

This should return all of the current containers that user is allowed to see.

[
  {
    "location": "...",
    "id": "12345",
    "uri": "/containers/CodeNeuro-notebooks/12345"
  },
  {
    "location": "...",
    "id": "787234",
    "uri": "/containers/CodeNeuro-notebooks/787234"
  }
]

from binder.

rgbkrk commented on July 30, 2024

The last thing I posted, about returning all the currently spawned containers, would be super helpful for operations (as I've faced with tmpnb). It's probably worth thinking about authentication sooner rather than later, even if for your own administration.

It's easy to defer to a separate authentication store, relying on an LRU key store (that's what I made https://github.com/rgbkrk/lru-key-store for, when deferring to a separate identity service) to keep yourself from making repeated calls to, e.g. GitHub or another provider.

from binder.

rgbkrk commented on July 30, 2024

Working with pools of pre-allocated binders

Thinking about the pooling that tmpnb does (and that I would hope for in binder), I imagine we could have an endpoint at /pool/ to set up capacities (and inspect allocations) for images:

GET /pool/{imageName} HTTP 1.1

returns

{
  "running": 123,
  "available": 12,
  "minPool": 1
}

Updating the pool (by POST or PUT):

POST /pool/{imageName}
Authorization: 9f66083738d8e8fa48e2f19d4bd3bdb4821fa2d3fdc7d84e4228ded5e219

{
  "minPool": 512
}

from binder.

rgbkrk commented on July 30, 2024

How would you feel if spawn and pool both take image names (binders?) and it's up to the underlying implementation for whether they'll run that image or not?

from binder.

andrewosh commented on July 30, 2024

That seems good to me. I'm imagining that would look something like:

Send the authorized PUT to /pool/ with the format you described above to create the pool
Check imageName against a whitelist of pool-able images (presumably only a small set of single-kernel images)
If imageName is either forbidden or does not exist, synchronously return some error code
Update the number of containers allocated to the pool with an authorized POST

Sound right?

Thinking along these lines, we should definitely come up with a plan for decoupling image names from GitHub repos. There's currently an empty class called OtherSourceHandler which was intended to complement GithubBuildHandler, but for images from arbitrary sources (whose names can't be uniquely constructed from an organization/name combo). Perhaps that change should go in another issue...

from binder.

rgbkrk commented on July 30, 2024

Yeah, sorry I brain dumped it all here. This would work well on a wiki once we're specced out. I imagine I'll use the same API on revision to tmpnb, which in my eyes would be a fairly restricted launcher that relies on pure Docker API (including swarm support out of the box).

For binder images that get created, I'd think of those as existing in a particular namespace - then you'd assume they're trusted.

Thinking about what I wrote though, maybe spawn and pool both take a binderName since it's coupled with services, etc.

from binder.

freeman-lab commented on July 30, 2024

This is awesome @rgbkrk ! I really like the API design and strategy. A couple comments (after chatting with @andrewosh ):

Great to split the GET and POST for the spawn, we wanted to decouple those anyway to make it easier to support a loading screen / query for finished deployments.

Not totally clear the best way to "spawn" an image that's already been "spawned" as part of a pool. Does it make sense that a POST to spawn either

(1) returns a location immediately if the the image was part of a pool, which we can figure out via a GET on the pool
or
(2) triggers an actual "spawn", and just returns an id, and then polling or websockets could be used to check whether its been deployed and has a valid location.

Reasonable, or is there a better design?

Another question, re: naming, is whether we only support images linked to GitHub repos, in which case we can have a one-to-one mapping from repos <-> names. This will of course work for binders built from repos (by design), but for the "standard" images we want to support (e.g. for the lighterweight kernels), do we need more flexibility? Or can we assume those are linked to repos too?

from binder.

freeman-lab commented on July 30, 2024

and 👍 to putting this into a wiki once specced out, and then eventually into proper documentation =)

from binder.

rgbkrk commented on July 30, 2024

(2) triggers an actual "spawn", and just returns an id, and then polling or websockets could be used to check whether its been deployed and has a valid location.

I was curious about that too. I'll go ahead and update the above to reflect that, as I think the POST should return immediately. For now we can stick with long polling on the GET.

from binder.

rgbkrk commented on July 30, 2024

Not totally clear the best way to "spawn" an image that's already been "spawned" as part of a pool.

Continuum once told me that I was cheating, since they're pre-spawned. I say it's an optimization we needed.

As you've said, they are already spawned when in the pool. Perhaps they're hatching at this point? Being allocated? Other nomenclature:

/hatchling/
/container/

These are the states we're wondering about:

Allocated, unused containers
Allocated, being used containers
Containers being culled
Containers being allocated

Perhaps this is a relationship between a container and a userContainer. That or a user has a container resource.

from binder.

rgbkrk commented on July 30, 2024

What do you think about calling the resource we bundle/build an environment? To a user we might call them binders, but in other contexts they're a built out resource for on demand computation coupled with:

specification of software to install
notebooks, data, code, to pull in (from a repo or otherwise - ok to stick with GH for the moment)
linked services
volumes, etc.
the application to serve/run

The last one is important since there are many uses for kernels beyond notebooks. As @parente put it on a hackpad we were mocking up for a Kernel Service API:

... launch this random.choice(['kernel', 'notebook', 'dashboard-app']) ...

As long as we make an API spec we're happy with, we'll probably end up with at least these three cluster managers

Kubernetes - Binder
Marathon / Mesos - mentioned as done by IBM
Docker / Swarm - tmpnb, dockerspawner (jupyterhub), ephemit

from binder.

rgbkrk commented on July 30, 2024

Once you have the kernels decoupled as the only thing that is provisioned, you can serve notebooks from a separate content store. That content store could even be GitHub itself. 👍 Doing it with gists is pretty simple as well. This also opens up the use of Google Drive, firebase, etc. as content stores that also allow for realtime collaboration.

However, we still want to be able to access datasets, etc. many times in typical POSIX ways (CSVs, etc.) straight from repositories. Hence why we either need to mount volumes, build them into the container, or expose services for data.

I do have a prototype that cheats and uses the notebook contents API to inject data from a repo. That code is not being used, but I've tried it out locally for fun. Example that renders a notebook from a gist, populates thebe cells, and connects you to a tmpnb kernel.

from binder.

freeman-lab commented on July 30, 2024

Re: naming -- which is the most important part ;) -- we're way into aiming for a more general spec, to be used here and elsewhere, and here's our current favorite (with pithy summaries from above):

build
POST defines and creates a build, GET returns info on the build

container
POST triggers a launch (if unavailable) or returns a location (if available), GET returns a location (once available)

pool
POST or PUT sets up capacities, GET inspects allocations

build was formerly binder, the new name is more general, and container was formerlyspawn, but the new name is more allowing of the fact that these containers may have already been deployed (in a pool).

from binder.

rgbkrk commented on July 30, 2024

build makes sense, though the attached services make it more like a deployment specification/template.

from binder.

API Design for binder resources about binder HOT 15 OPEN

Comments (15)

Building a binder image

Creating a new binder

Detail on a binder

Launching a binder

Working with pools of pre-allocated binders

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent