Giter Club home page Giter Club logo

renku-notebooks's Issues

provide appropriate image ref/tag to use for a commit

The notebook service should provide an endpoint that determines which is the image that should be used for a particular commit - this is assuming that we only build images on changes to the files that define the environment. For example, we could use the functionality of gitlab CI which can limit job triggers to only when specific paths change:

image_build:
  stage: build
  image: docker:stable
  before_script:
    - docker login -u gitlab-ci-token -p $CI_JOB_TOKEN http://$CI_REGISTRY
  script:
    - CI_COMMIT_SHA_7=$(echo $CI_COMMIT_SHA | cut -c1-7)
    - docker pull renku/singleuser:latest
    - docker build --tag $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7 .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA_7
  only: 
    changes:
      - Dockerfile
      - requirements.txt 
      - envs/*
  tags:
    - image-build

related to #98

use dns-safe server names

Still failing on usernames with non alpha-numeric characters:

{"reason":"FieldValueInvalid","message":"Invalid value: \"renku-jupyter-rok-2erosk-rok_2erosk-proj-85a5229\": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')"

The username was rok.roskar. This issue is continued from SwissDataScienceCenter/renku#252.

should wait for server to be ready before redirecting

If the notebook server takes a long time to spawn (e.g. because images have to be downloaded), JH will wait for 30 seconds and then redirect to a url like .../hub/user/... which fails with a 401 because it's not actually a valid URL (it still gets proxied to the notebooks service for some reason which queries gitlab for a non-existent project, hence the 401).

The way to fix this is to wait until the server is ready before redirecting. It's unclear how to get this information.

moved from SwissDataScienceCenter/renku#210 -- see SwissDataScienceCenter/renku#210 (comment)

Session backup

Sometimes notebook instances stop (e.g. node enters an out of memory condition) and current work is permanently lost.

Is there a way to provide some form of backup for these cases?

accept a JWT from trusted source

We should allow the notebook service to consume JWT tokens for authentication - this is a pre-requisite for providing dedicated resources to certain users and/or groups.

OOM issue not reported on jupyterlab

On Renkulab: If an operation in a notebook runs out of memory, this is not reported by Jupyterlab. The kernel will restart, but cell will continue to appear to be pending. There is no error message in the notebook itself.

Create a notebooks landing page

The <service-prefix> endpoint right now just gives a JSON of the user object. It should give the authenticated user an overview of the running notebook servers and the means to stop them.

One main issue to resolve is how to actually get the information about running servers per user. Right now this seems to be kept in the user object obtained through the HubOAuthenticator but this is cached via some (as of yet) to me obscure mechanism. Since the user info is cached, we don't get up-to-date info unless we reduce the caching timeout. There should be some way to query the hub directly, however.

cc/ @ableuler @ciyer

related to SwissDataScienceCenter/renku-ui#151

400 when trying to access a server that is being terminated

The server information we receive from the REST API does no include server state. If we stop a server, it disappears from the list of user's servers -- however, it may still be in the process of shutting down. If the user then tries to start it again before it shuts down completely (in k8s before the pod is completely gone) then JH returns a 400 (pod is terminating). We need to somehow get more up-to-date information about server state to mitigate this.

Fix float server options

...
 "resources": {
        "cpu_request": {
            "default": 0.1,
            "displayName": "Number of CPUs",
            "enum": "float",
            "options": [
                0.1,
                0.5,
                1,
                2,
                4,
                8
            ]
        },
...

note the enum instead of type.

Use symlink, not alias for renku in singleuser

echo "alias renku=$CONDA_DIR/envs/renku/bin/renku" >> /home/$NB_USER/.bashrc

From magic recipe:

conda create -y -n renku python=3.6
$(conda env list | grep renku | awk '{print $2}')/bin/pip install -e git+https://github.com/SwissDataScienceCenter/renku-python.git#egg=renku
mkdir -p ~/.renku/bin
ln -s "$(conda env list | grep renku | awk '{print $2}')/bin/renku" ~/.renku/bin/renku
echo "export PATH=~/.renku/bin:$PATH" >> $HOME/.bashrc
source $HOME/.bashrc
renku --version
which renku

Introduce timeout for pending images.

Avoid infinite loop if build job is stuck.

See:

while True:
if status == 'success':
# the image was built
# it *should* be there so lets use it
self.image = '{image_registry}'\
'/{namespace}'\
'/{project}'\
':{commit_sha_7}'.format(
image_registry=os.getenv('IMAGE_REGISTRY'),
commit_sha_7=commit_sha_7,
**options
).lower()
self.log.info(
'Using image {image}.'.format(image=self.image)
)
break
elif status in {'failed', 'canceled'}:
self.log.info(
'Image build failed for project {0} commit {1} - '
'using {2} instead'.format(
project, commit_sha, self.image
)
)
break
yield gen.sleep(5)
status = self._get_job_status(pipeline, 'image_build')
self.log.debug(
'status of image_build job for commit '
'{commit_sha_7}: {status}'.format(
commit_sha_7=commit_sha_7, status=status
)
)

was SwissDataScienceCenter/renku#318 reported by @leafty

Create a better server launch flow

At the moment, we simply issue an API request to jupyterhub for a server spawn and wait until jupyterhub reports that this server is running. This is definitely not the way we want to handle server launches.

We should improve the sequence to provide some extra feedback to the user about what is happening. Most importantly, the notebooks service flask app should not block waiting for the server to spawn. We should discuss how to move this forward -- some options:

  • if the server for <namespace>/<project>/<sha> is not running, the endpoint should return an page that shows the status of the pod/server being launched and a 202 return code
  • if the server for <namespace>/<project>/<sha> is running, the server/notebook is returned as normal
  • the page with the status should redirect to the running server once it's up
  • we should use k8s/docker clients for checking on the server spawn so we can report errors as they arise. If the pod launch errors, call DELETE on the JH API for the server so that it is immediately removed from the proxy etc.
  • do we need a <namespace>/<project>/<sha>/status endpoint? Or just <namespace>/<project>/<sha>?status? No, use CRUD and make POST launch the notebook, GET gives the status of the launch

Once this is done, it will most likely fix #23

Don't make us wait forever when starting a docker/jupyterlab fails.

From @erbou on September 28, 2018 10:22

Is your feature request related to a problem? Please describe.
The starting Jupyterlab will not notify me if something went wrong, and can make me wait forever.

Describe the solution you'd like
Give the option to get more details (can be summarized, no need for a full log) about what it is doing, and what's left to do, with a brief status (Ok, Fail) of operation, such as:

* [Ok]   starting docker container
* [Fail]  cloning repo
* [    ]   importing data   <- for git submodule update / git lfs pull

Describe alternatives you've considered
At a minimum we should report an error, so that know when there's no point to keep waiting.

Copied from original issue: SwissDataScienceCenter/renku-ui#325

Modify `cache-control` header for `server_options` endpoint

Currently, the notebook service sets a max-age value of 12 hours on the cache-control header for the server_options endpoint. I assume that this was not set on purpose, but that it's just a side-effect from serving the json as a static file. I suggest to dramatically reduce this value or to remove the header all together.

Return absolute instead of relative URLs

URLs returned by the notebook service api should be absolute, i.e. include the host name.

...
url: "/jupyterhub/user/cramakri/cramakri-weather-2d-9789210/"

should become

'''
url: "https://renkulab.io/jupyterhub/user/cramakri/cramakri-weather-2d-9789210/"

such that the UI can open that URL directly when opening a notebook server tab for the user.

Still some uppercase issues

Pod \"jupyter-johann-2Et-johann-2et-newnew-a342109\" is invalid: [metadata.name: Invalid value: \"jupyter-johann-2Et-johann-2et-newnew-a342109\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character

how to remove pending servers

It appears that JH doesn't offer an API for removing spawning servers. We should somehow be able to remove them, however, if e.g. someone tries to spawn a server and a requested number of GPUs are not available.

Project with upper case is not started properly

The mounted volume (empty for) does not respect case at the moment. It results in a lab instance which cannot be used

Edit: This happens when converting existing projects to Renku (and not when creating them through the UI).

limit the number of simultaneous servers

When a user requests a new server, check how many are already running and if that number is equal to MAX_USER_SERVERS then shut down the oldest one before starting the new one.

Note (@leafty): shutdown oldest from the same (project, user). The idea is that the user is iterating with his/her Docker image

http 500 while pod is starting

Once the pod is running, the page loads fine. This happened while waiting for the pod to start.

log:

10.36.0.8 - - [10/Jul/2018 06:56:58] "GET /jupyterhub/services/notebooks/demo/test/74871936eed7d586d8034a3ecadd444f369492df?branch=master HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2309, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2295, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1741, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/src/notebooks_service.py", line 77, in decorated
    return f(user, *args, **kwargs)
  File "/app/src/notebooks_service.py", line 218, in launch_notebook
    headers=headers
  File "/usr/local/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

use asyncio in launch_notebook

The launch_notebook function is synchronous -- this will eventually lead to a terrible user experience. Should be using asyncio.

validate server_options

The server_options need validation in several places:

  • in the values.yaml that the admin passes on deployment
  • in the service code where the request withserverOptions in the body is processed

It's not obvious what the validation should consist of, however.

  1. Enforce only a specific set of server options, e.g. resources.cpu_request, resources.mem_request etc.
  2. Make sure that the serverOptions that are passed in through the request conform to the types specified in the values.yaml.
  3. ?

This validation was left as a todo from pr #68.

disable smudge with repo option

This should be set in the repository on checkout:

git lfs install --skip-smudge --local

Otherwise, every git checkout command will try to pull LFS objects leading to general unhappiness.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.