swissdatasciencecenter / renku-notebooks Goto Github PK

View Code? Open in Web Editor NEW

6.0 10.0 7.0 4.61 MB

An API service to provide jupyter notebooks for the Renku platform.

Home Page: https://renkulab.io

License: Apache License 2.0

Python 88.31% Shell 0.33% Makefile 0.45% Dockerfile 0.73% Go 10.18%

renku

renku-notebooks's Issues

provide appropriate image ref/tag to use for a commit

The notebook service should provide an endpoint that determines which is the image that should be used for a particular commit - this is assuming that we only build images on changes to the files that define the environment. For example, we could use the functionality of gitlab CI which can limit job triggers to only when specific paths change:

image_build:
  stage: build
  image: docker:stable
  before_script:
    - docker login -u gitlab-ci-token -p $CI_JOB_TOKEN http://$CI_REGISTRY
  script:
    - CI_COMMIT_SHA_7=$(echo $CI_COMMIT_SHA | cut -c1-7)
    - docker pull renku/singleuser:latest
    - docker build --tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7 .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
  only: 
    changes:
      - Dockerfile
      - requirements.txt 
      - envs/*
  tags:
    - image-build

related to #98

use dns-safe server names

Still failing on usernames with non alpha-numeric characters:

{"reason":"FieldValueInvalid","message":"Invalid value: \"renku-jupyter-rok-2erosk-rok_2erosk-proj-85a5229\": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')"

The username was rok.roskar. This issue is continued from SwissDataScienceCenter/renku#252.

should wait for server to be ready before redirecting

If the notebook server takes a long time to spawn (e.g. because images have to be downloaded), JH will wait for 30 seconds and then redirect to a url like .../hub/user/... which fails with a 401 because it's not actually a valid URL (it still gets proxied to the notebooks service for some reason which queries gitlab for a non-existent project, hence the 401).

The way to fix this is to wait until the server is ready before redirecting. It's unclear how to get this information.

moved from SwissDataScienceCenter/renku#210 -- see SwissDataScienceCenter/renku#210 (comment)

Session backup

Sometimes notebook instances stop (e.g. node enters an out of memory condition) and current work is permanently lost.

Is there a way to provide some form of backup for these cases?

Move to JupyterHub 0.9

Changes in here needed to resolve SwissDataScienceCenter/renku#299

Access control for projects with members in a group

https://github.com/SwissDataScienceCenter/renku-notebooks/blob/master/jupyterhub/spawners.py#L111
returns 404

in access_level = gl_project.members.get(self.gl_user.id).access_level : the members are only the manually added ones, not the ones belonging to a group (and the access was granted to the goup).

Use JH as a dependency

Jupyterhub should be a dependency of renku-notebooks.

accept a JWT from trusted source

We should allow the notebook service to consume JWT tokens for authentication - this is a pre-requisite for providing dedicated resources to certain users and/or groups.

add the jupyterlab-git extension to default image

Since jupyterlab/jupyterlab-git#210 has been merged, should we add the extension by default?

OOM issue not reported on jupyterlab

On Renkulab: If an operation in a notebook runs out of memory, this is not reported by Jupyterlab. The kernel will restart, but cell will continue to appear to be pending. There is no error message in the notebook itself.

Create a notebooks landing page

The <service-prefix> endpoint right now just gives a JSON of the user object. It should give the authenticated user an overview of the running notebook servers and the means to stop them.

One main issue to resolve is how to actually get the information about running servers per user. Right now this seems to be kept in the user object obtained through the HubOAuthenticator but this is cached via some (as of yet) to me obscure mechanism. Since the user info is cached, we don't get up-to-date info unless we reduce the caching timeout. There should be some way to query the hub directly, however.

cc/ @ableuler @ciyer

include possibility to add GPU requirements to pod manifest

Add options to include a GPU request in the pod manifest.

400 when trying to access a server that is being terminated

The server information we receive from the REST API does no include server state. If we stop a server, it disappears from the list of user's servers -- however, it may still be in the process of shutting down. If the user then tries to start it again before it shuts down completely (in k8s before the pod is completely gone) then JH returns a 400 (pod is terminating). We need to somehow get more up-to-date information about server state to mitigate this.

create a k8s service account to watch pods

This is needed to better respond to problems during server spawn, e.g. when an image is not able to be pulled.

Support open Project in Jupyter Lab

`imageBuildTimeout` is not used

After #56, if the image is still building, renku-notebooks launches the default image instead of waiting imageBuildTimeout

use labels to identify pod that belongs to a user/project/sha1

Write tests

Need some actual tests

set ephemeral storage limits on pods

Pod disk storage is turning out to be a problem (not unexpectedly) -- we should allow pods to make a resource request for ephemeral storage.

Fix float server options

...
 "resources": {
        "cpu_request": {
            "default": 0.1,
            "displayName": "Number of CPUs",
            "enum": "float",
            "options": [
                0.1,
                0.5,
                1,
                2,
                4,
                8
            ]
        },
...

note the enum instead of type.

Use symlink, not alias for renku in singleuser

renku-notebooks/singleuser/Dockerfile

Line 51 in e3a5b98

echo "alias renku=$CONDA_DIR/envs/renku/bin/renku" >> /home/$NB_USER/.bashrc

From magic recipe:

conda create -y -n renku python=3.6
$(conda env list | grep renku | awk '{print $2}')/bin/pip install -e git+https://github.com/SwissDataScienceCenter/renku-python.git#egg=renku
mkdir -p ~/.renku/bin
ln -s "$(conda env list | grep renku | awk '{print $2}')/bin/renku" ~/.renku/bin/renku
echo "export PATH=~/.renku/bin:$PATH" >> $HOME/.bashrc
source $HOME/.bashrc
renku --version
which renku

Introduce timeout for pending images.

Avoid infinite loop if build job is stuck.

See:

renku-notebooks/jupyterhub/spawners.py

Lines 137 to 168 in d7ac292

 while True: 

 if status == 'success': 

 # the image was built 

 # it *should* be there so lets use it 

 self.image = '{image_registry}'\ 

 '/{namespace}'\ 

 '/{project}'\ 

 ':{commit_sha_7}'.format( 

 image_registry=os.getenv('IMAGE_REGISTRY'), 

 commit_sha_7=commit_sha_7, 

 **options 

 ).lower() 

 self.log.info( 

 'Using image {image}.'.format(image=self.image) 

 ) 

 break 

 elif status in {'failed', 'canceled'}: 

 self.log.info( 

 'Image build failed for project {0} commit {1} - ' 

 'using {2} instead'.format( 

 project, commit_sha, self.image 

 ) 

 ) 

 break 

 yield gen.sleep(5) 

 status = self._get_job_status(pipeline, 'image_build') 

 self.log.debug( 

 'status of image_build job for commit ' 

 '{commit_sha_7}: {status}'.format( 

 commit_sha_7=commit_sha_7, status=status 

 ) 

 )

was SwissDataScienceCenter/renku#318 reported by @leafty

Create a better server launch flow

At the moment, we simply issue an API request to jupyterhub for a server spawn and wait until jupyterhub reports that this server is running. This is definitely not the way we want to handle server launches.

We should improve the sequence to provide some extra feedback to the user about what is happening. Most importantly, the notebooks service flask app should not block waiting for the server to spawn. We should discuss how to move this forward -- some options:

if the server for <namespace>/<project>/<sha> is not running, the endpoint should return an page that shows the status of the pod/server being launched and a 202 return code
if the server for <namespace>/<project>/<sha> is running, the server/notebook is returned as normal
the page with the status should redirect to the running server once it's up
we should use k8s/docker clients for checking on the server spawn so we can report errors as they arise. If the pod launch errors, call DELETE on the JH API for the server so that it is immediately removed from the proxy etc.
~~do we need a <namespace>/<project>/<sha>/status endpoint? Or just <namespace>/<project>/<sha>?status?~~ No, use CRUD and make POST launch the notebook, GET gives the status of the launch

Once this is done, it will most likely fix #23

provide GET response based on headers

if application/json is requested, return the JSON for the server status

Write service definition

The notebooks service should publish its swagger file.

Don't make us wait forever when starting a docker/jupyterlab fails.

From @erbou on September 28, 2018 10:22

Is your feature request related to a problem? Please describe.
The starting Jupyterlab will not notify me if something went wrong, and can make me wait forever.

Describe the solution you'd like
Give the option to get more details (can be summarized, no need for a full log) about what it is doing, and what's left to do, with a brief status (Ok, Fail) of operation, such as:

* [Ok]   starting docker container
* [Fail]  cloning repo
* [    ]   importing data   <- for git submodule update / git lfs pull

Describe alternatives you've considered
At a minimum we should report an error, so that know when there's no point to keep waiting.

Copied from original issue: SwissDataScienceCenter/renku-ui#325

`default_url` not properly set

the defaultUrl value is not propagated to the container.

Modify `cache-control` header for `server_options` endpoint

Currently, the notebook service sets a max-age value of 12 hours on the cache-control header for the server_options endpoint. I assume that this was not set on purpose, but that it's just a side-effect from serving the json as a static file. I suggest to dramatically reduce this value or to remove the header all together.

Return absolute instead of relative URLs

URLs returned by the notebook service api should be absolute, i.e. include the host name.

...
url: "/jupyterhub/user/cramakri/cramakri-weather-2d-9789210/"

should become

'''
url: "https://renkulab.io/jupyterhub/user/cramakri/cramakri-weather-2d-9789210/"

such that the UI can open that URL directly when opening a notebook server tab for the user.

allow authorization through token in the header

The @authenticated decorator should accept JupyterHub-issued tokens.

provide a way to check if the image exists outside of spawn request

We should provide a way to check with the registry if the image exists and if it doesn't give options to either build it or to use the default image.

spawners: cache LFS objects on separate volume

Set Git config lfs.storage to a separate volume that can be shared among notebook servers spawned for the same project.

Still some uppercase issues

Pod \"jupyter-johann-2Et-johann-2et-newnew-a342109\" is invalid: [metadata.name: Invalid value: \"jupyter-johann-2Et-johann-2et-newnew-a342109\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character

notebooks: launch multiple servers for different branches

?branch=<name> should create a server with different name that includes branch name

https://github.com/SwissDataScienceCenter/renku-notebooks/blob/master/src/notebooks_service.py#L102-L113

Connecting to server from notebooks-ui during spawn results in 500

Clicking on Connect while the server is not ready results in a 500 page on jupyterhub.

how to remove pending servers

It appears that JH doesn't offer an API for removing spawning servers. We should somehow be able to remove them, however, if e.g. someone tries to spawn a server and a requested number of GPUs are not available.

Project with upper case is not started properly

The mounted volume (empty for) does not respect case at the moment. It results in a lab instance which cannot be used

Edit: This happens when converting existing projects to Renku (and not when creating them through the UI).

do not use the flask development server

Use gunicorn or similar in the Docker image.

limit the number of simultaneous servers

When a user requests a new server, check how many are already running and if that number is equal to MAX_USER_SERVERS then shut down the oldest one before starting the new one.

Note (@leafty): shutdown oldest from the same (project, user). The idea is that the user is iterating with his/her Docker image

http 500 while pod is starting

Once the pod is running, the page loads fine. This happened while waiting for the pod to start.

log:

10.36.0.8 - - [10/Jul/2018 06:56:58] "GET /jupyterhub/services/notebooks/demo/test/74871936eed7d586d8034a3ecadd444f369492df?branch=master HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2309, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2295, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1741, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/src/notebooks_service.py", line 77, in decorated
    return f(user, *args, **kwargs)
  File "/app/src/notebooks_service.py", line 218, in launch_notebook
    headers=headers
  File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

move image checks from spawner code to the notebooks service

The spawner should only execute the instruction to use a certain image and not do anything about the image build itself.

use asyncio in launch_notebook

The launch_notebook function is synchronous -- this will eventually lead to a terrible user experience. Should be using asyncio.

validate server_options

The server_options need validation in several places:

in the values.yaml that the admin passes on deployment
in the service code where the request withserverOptions in the body is processed

It's not obvious what the validation should consist of, however.

Enforce only a specific set of server options, e.g. resources.cpu_request, resources.mem_request etc.
Make sure that the serverOptions that are passed in through the request conform to the types specified in the values.yaml.
?

This validation was left as a todo from pr #68.

vim missing from defaut image

It would be cool to have vim pre-installed on the default image.

clicking "launch" for an existing notebook yields "bad request"

If a server is already running, this is not handled properly.

create a renku rstudio image

From @rokroskar on September 21, 2018 6:8

We need a Dockerfile that uses the nbrsessionproxy Dockerfile as a base, adding on renku-related pieces. See https://github.com/jupyterhub/nbrsessionproxy/blob/master/Dockerfile for a start.

Copied from original issue: SwissDataScienceCenter/renku#425

use server options for launching jupyter servers

provide a server_options endpoint
consume server_options from POST body and relay to the spawner
modify the spawner to include the server options in the pod manifest

disable smudge with repo option

This should be set in the repository on checkout:

git lfs install --skip-smudge --local

Otherwise, every git checkout command will try to pull LFS objects leading to general unhappiness.

configure travis deploy keys

need to configure ssh deploy keys for pushing to the chart repo and docker hub.

Do not load LFS objects by default

We should have GIT_LFS_SKIP_SMUDGE=1 set by default and the possibility to override it via server_options.

	while True:
	if status == 'success':
	# the image was built
	# it should be there so lets use it
	self.image = '{image_registry}'\
	'/{namespace}'\
	'/{project}'\
	':{commit_sha_7}'.format(
	image_registry=os.getenv('IMAGE_REGISTRY'),
	commit_sha_7=commit_sha_7,
	**options
	).lower()
	self.log.info(
	'Using image {image}.'.format(image=self.image)
	)
	break
	elif status in {'failed', 'canceled'}:
	self.log.info(
	'Image build failed for project {0} commit {1} - '
	'using {2} instead'.format(
	project, commit_sha, self.image
	)
	)
	break
	yield gen.sleep(5)
	status = self._get_job_status(pipeline, 'image_build')
	self.log.debug(
	'status of image_build job for commit '
	'{commit_sha_7}: {status}'.format(
	commit_sha_7=commit_sha_7, status=status
	)
	)

swissdatasciencecenter / renku-notebooks Goto Github PK

renku-notebooks's Issues

Recommend Projects

Recommend Topics

Recommend Org