swissdatasciencecenter / renku-python Goto Github PK

View Code? Open in Web Editor NEW

37.0 9.0 29.0 41.54 MB

A Python library for the Renku collaborative data science platform.

Home Page: https://renku-python.readthedocs.io/

License: Apache License 2.0

Python 99.08% Shell 0.30% Makefile 0.08% Dockerfile 0.08% JavaScript 0.47%

renku

renku-python's Introduction

Renku Python Library, CLI and Service

A Python library for the Renku collaborative data science platform. It includes a CLI and SDK for end-users as well as a service backend. It provides functionality for the creation and management of projects and datasets, and simple utilities to capture data provenance while performing analysis tasks.

NOTE:: renku-python is the python library and core service for Renku - it does not start the Renku platform itself - for that, refer to the Renku docs on running the platform.

Renku for Users

Installation

Renku releases and development versions are available from PyPI. You can install it using any tool that knows how to handle PyPI packages. Our recommendation is to use :code:pipx.

Note

We do not officially support Windows at this moment. The way Windows handles paths and symlinks interferes with some Renku functionality. We recommend using the Windows Subsystem for Linux (WSL) to use Renku on Windows.

Prerequisites

Renku depends on Git under the hood, so make sure that you have Git installed on your system.

Renku also offers support to store large files in Git LFS, which is used by default and should be installed on your system. If you do not wish to use Git LFS, you can run Renku commands with the -S flag, as in renku -S <command>. More information on Git LFS usage in renku can be found in the Data in Renku section of the docs.

Renku uses CWL to execute recorded workflows when calling renku update or renku rerun. CWL depends on NodeJs to execute the workflows, so installing NodeJs is required if you want to use those features.

For development of the service, Docker is recommended.

`pipx`

First, install pipx and make sure that the $PATH is correctly configured.

$ python3 -m pip install --user pipx
$ python3 -m pipx ensurepath

Once pipx is installed use following command to install renku.

$ pipx install renku
$ which renku
~/.local/bin/renku

pipx installs Renku into its own virtual environment, making sure that it does not pollute any other packages or versions that you may have already installed.

Note

If you install Renku as a dependency in a virtual environment and the environment is active, your shell will default to the version installed in the virtual environment, not the version installed by pipx.

To install a development release:

$ pipx install --pip-args pre renku

`pip`

$ pip install renku

The latest development versions are available on PyPI or from the Git repository:

$ pip install --pre renku
# - OR -
$ pip install -e git+https://github.com/SwissDataScienceCenter/renku-python.git#egg=renku

Use following installation steps based on your operating system and preferences if you would like to work with the command line interface and you do not need the Python library to be importable.

Windows

Note

We don't officially support Windows yet, but Renku works well in the Windows Subsystem for Linux (WSL). As such, the following can be regarded as a best effort description on how to get started with Renku on Windows.

Renku can be run using the Windows Subsystem for Linux (WSL). To install the WSL, please follow the official instructions.

We recommend you use the Ubuntu 20.04 image in the WSL when you get to that step of the installation.

Once WSL is installed, launch the WSL terminal and install the packages required by Renku with:

$ sudo apt-get update && sudo apt-get install git python3 python3-pip python3-venv pipx

Since Ubuntu has an older version of git LFS installed by default which is known to have some bugs when cloning repositories, we recommend you manually install the newest version by following these instructions.

Once all the requirements are installed, you can install Renku normally by running:

$ pipx install renku
$ pipx ensurepath

After this, Renku is ready to use. You can access your Windows in the various mount points in /mnt/ and you can execute Windows executables (e.g. \*.exe) as usual directly from the WSL (so renku run myexecutable.exe will work as expected).

Docker

The containerized version of the CLI can be launched using Docker command.

$ docker run -it -v "$PWD":"$PWD" -w="$PWD" renku/renku-python renku

It makes sure your current directory is mounted to the same place in the container.

CLI Example

Initialize a Renku project:

$ mkdir -p ~/temp/my-renku-project
$ cd ~/temp/my-renku-project
$ renku init

Create a dataset and add data to it:

$ renku dataset create my-dataset
$ renku dataset add my-dataset https://raw.githubusercontent.com/SwissDataScienceCenter/renku-python/master/README.rst

Run an analysis:

$ renku run --name my-workflow -- wc < data/my-dataset/README.rst > wc_readme

Trace the data provenance:

$ renku workflow visualize wc_readme

These are the basics, but there is much more that Renku allows you to do with your data analysis workflows. The full documentation will soon be available at: https://renku-python.readthedocs.io/

Renku as a Service

This repository includes a renku-core RPC service written as a Flask application that provides (almost) all of the functionality of the Renku CLI. This is used to provide one of the backends for the RenkuLab web UI. The service can be deployed in production as a Helm chart (see helm-chart.

Deploying locally

To test the service functionality you can deploy it quickly and easily using docker-compose up [docker-compose](https://pypi.org/project/docker-compose/). Make sure to make a copy of the renku/service/.env-example file and configure it to your needs. The setup here is to expose the service behind a traefik reverse proxy to mimic an actual production deployment. You can access the proxied endpoints at http://localhost/api. The service itself is exposed on port 8080 so its endpoints are available directly under http://localhost:8080.

API Documentation

The renku core service implements the API documentation as an OpenAPI 3.0.x spec. You can retrieve the yaml of the specification itself with

` $ renku service apispec`

If deploying the service locally with docker-compose you can find the swagger-UI under localhost/api/swagger. To send the proper authorization headers to the service endpoints, click the Authorize button and enter a valid JWT token and a gitlab token with read/write repository scopes. The JWT token can be obtained by logging in to a renku instance with renku login and retrieving it from your local renku configuration.

In a live deployment, the swagger documentation is available under https://<renku-endpoint>/swagger. You can authorize the API by first logging into renku normally, then going to the swagger page, clicking Authorize and picking the oidc (OAuth2, authorization_code) option. Leave the client_id as swagger and the client_secret empty, select all scopes and click Authorize. You should now be logged in and you can send requests using the Try it out buttons on individual requests.

Developing Renku

For testing the functionality from source it is convenient to install renku in editable mode using pipx. Clone the repository and then do:

$ pipx install \
    --editable \
    <path-to-renku-python>[all] \
    renku

This will install all the extras for testing and debugging.

If you already use pyenv to manage different python versions, you may be interested in installing pyenv-virtualenv to create an ad-hoc virtual environment for developing renku.

Once you have created and activated a virtual environment for renku-python, you can use the usual pip commands to install the required dependencies.

$ pip install -e .[all]  # use `.[all]` for zsh

Service

Developing the service and testing its APIs can be done with docker compose (see "Deploying Locally" above).

If you have a full RenkuLab deployment at your disposal, you can use telepresence v1 to develop and debug locally. Just run the start-telepresence.sh script and follow the instructions. Mind that the script doesn't work with telepresence v2.

Running tests

We use pytest for running tests. You can use our run-tests.sh script for running specific set of tests.

$ ./run-tests.sh -h

We lint the files using black and isort.

Using External Debuggers

Local Machine

To run renku via e.g. the Visual Studio Code debugger you need run it via the python executable in whatever virtual environment was used to install renku. If there is a package needed for the debugger, you need to inject it into the virtual environment first, e.g.:

$ pipx inject renku ptvsd

Finally, run renku via the debugger:

$ ~/.local/pipx/venvs/renku/bin/python -m ptvsd --host localhost --wait -m renku.ui.cli <command>

If using Visual Studio Code, you may also want to set the Remote Attach configuration PathMappings so that it will find your source code, e.g.

{
    "name": "Python: Remote Attach",
    "type": "python",
    "request": "attach",
    "port": 5678,
    "host": "localhost",
    "pathMappings": [
        {
            "localRoot": "<path-to-renku-python-source-code>",
            "remoteRoot": "<path-to-renku-python-source-code>"
        }
    ]
}

renku-python's People

Contributors

Stargazers

Watchers

renku-python's Issues

`default_bucket` used in `add` when `autosync=true` but it is not defined at `init` time

In other words, the default bucket is not created by the cli at project creation time

provide a template LFS configuration for the newly initialized project

~~- [ ] renga init should also create a .gitattributes file that will sync the data/ directory.~~

cli functions like renga run should add appropriate paths to git lfs for tracking
- renga run invocations should add the outputs to .gitattributes by default -- provide a flag to override the behavior (done in #116)
- renga datasets add should add all files except for metadata.json

addresses SwissDataScienceCenter/renku#117

add option to specify a notebook image

Simple inputs and outputs

Define inputs and outputs and labels of a context:

labels:
  - renga.contexts.inputs.<name>(=[<bucket_id>,<file_id>])

relates to SwissDataScienceCenter/renga-deployer#46

return state and creation time in CLI listings

CLI: add project vertex_id to http headers

currently the project vertex_id is passed as a label -- move it to headers to have the same behavior as storage

done in #31

metadata: properly serialize nested JSON-LD-aware classes

Implement following helper attributes builders: jsonld.container.(set|list|index)
- example usage: jsonld.container.list(Author)

Options

a) build a single @context from all nested objects
b) include @context in each nested object + implement custom loader?

extracting submodule history fails if not modified by renga

Sequence to reproduce:

mkdir foo
mkdir bar
cd foo
renga init
echo woop > ../woop
renga datasets add dataset ../woop
cd ../bar
renga init
renga datasets create dataset
renga datasets add dataset ../foo/data/dataset/woop
renga run wc data/dataset/foo/data/dataset/woop > woop.wc
cd ../foo
echo woop2 > data/dataset/woop
git commit -am 'commiting changes to woop'
cd ../bar
git submodule update --rebase --remote
git commit -am 'update submodule'
renga status

On branch master
Files generated from outdated inputs:
  (use "renga log <file>..." to see the full lineage)

	woop.wc:

Normally it should display: <name_of_submodule>@<commit_sha1>.

Renga init with endpoint does not save it

Scenario:

renga init --endpoint https://example.com --autosync

Then the following is not working:

renga contexts list

I can fix by changing .renga/config.yml from:

core:
  autosync: true
  generated: '2017-10-25T11:36:10.689296'
  name: MyProject
endpoints:
  https://example.com:
    vertex_id: '20688'

core:
  autosync: true
  generated: '2017-10-25T11:36:10.689296'
  name: MyProject
  default: https://example.com
endpoints:
  https://example.com:
    vertex_id: '20688'

but it would be nice to have it inferred from the init command.

Better errors + cleanup when `renga init` is executed in an already existing git repo

If you run renga init in an already existing git repo, the error is confusing, and hard to know about the -f flag.
Also, it does not rollback or cleanup the .renga folder it created in the process, which throws error when you try a renga init -f later.

If you run renga init in an already existing renga repo, the error is confusing as you get a FileExistsError.

Implement `renga status` that checks the validity of data provenance

all CWL outputs should be produced from latest version of inputs

--
addresses SwissDataScienceCenter/renku#117

Publish docs

include autodoc for clients and cli
enable hook for Read the Docs

improve docs to reflect recent changes and new functionality

addresses #62

from_config doesn't work as advertised

In [4]: client = renga.from_config(endpoint='http://localhost')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-b511b07226e7> in <module>()
----> 1 client = renga.from_config(endpoint='http://localhost')

~/Projects/renga-python/renga/cli/_client.py in from_config(config, endpoint)
     31     """
     32     if config is None:
---> 33         config = read_config()
     34         project_config_path = get_project_config_path()
     35         if project_config_path:

TypeError: read_config() missing 1 required positional argument: 'path'

create a jupyter notebook docker image with renga-python pre-installed

dataset: recognize github urls and parse /tree/<branch-name>

we should recognize /tree/<branch-name> and checkout the branch

SDK: Read/write data with client

implement repr or str on models

Now: <renga.models.deployer.Context at 0x7f54d419eb00>
Expected: Context(id=foobar, ...) on all models

authentication: how to ensure the deployer-spawned session remains authenticated

Currently we pass an access token to the deployed container, which will expire. On solution could be to simply send a refresh token instead of the access token, but this has security implications. We need to solve this problem either by having a flow with authentication via the python client or some other means.

possibility to upload/download files from CLI

like for example (it can be different):

renga io buckets upload BUCKET_ID FILENAME

renga io download FILE_ID > FILENAME

define schemas for configs

Re-run CWL steps on the hosted platform

use the CI pipeline functionality of gitlab to rerun steps
create custom images for rerunning renga-generated steps
create a .gitlab-ci.yml automatically in each renga repo

Import datasets from renku-aware repos

Importing from a git repository that contains a .renku directory should automatically reuse the included metadata about authors/creators of various entities.

remove the local filesystem path (privacy issues)
reference the original dataset metadata file: $ref: ...
use submodule index to iterate over files when importing from a Git repo

addresses SwissDataScienceCenter/renku#135

cli: rerun with different parameters

enable this:

$ renga runner rerun --job job.yml .renga/workflow/step12345.cwl

create python api skeleton

Basic python API that can be used inside e.g. a jupyter notebook to interact with the platform.

instruct the user of changes to git repo if initializing renku in an existing one

renku init has some side effects if run in an existing git repo (inits lfs, for example). The user should be made aware of these changes and also instructed on how to bring the existing repo in-line with what we expect (e.g. command to run to add existing files to git-lfs).

Fix bugs with provenance calculation

fix DAG display when the same file is used as input in multiple steps

--
addresses SwissDataScienceCenter/renku#117

Resolve lineage in submodules

addresses SwissDataScienceCenter/renku#117

Packaging: PyPI

add renga-python to PyPI

create basic CLI

Need a CLI that allows for this workflow at a minimum:

$ renga login

guides the user through obtaining an offline token
set up the platform access points
create a ~/.renga.conf file that stores user settings and tokens

$ renga init <project>

initialize a project, including adding a node to the KG
creates a .renga metadata file for project-specific configuration

$ renga add

add code and/or data from KG
from git repo
from URL

renga notebook

launch a notebook, mounting . in the notebook container and setting it up with the proper environment for interacting with the platform

notebooks: recreate context when id from config is not found

$ make start
$ renga login
$ renga init --autosync
$ renga notebooks launch
$ make stop
$ docker-compose rm
$ make start
$ renga login
$ renga notebooks launch
# ERROR

Construct workflows from steps

link together several steps to form a workflow

construct a CWL workflow from a file's provenance graph
resolve dependency paths and save workflow to disk for reuse

trying to access a nonexistent context should return a 404

inside a renga-deployed notebook:

import renga
client = renga.from_env()
client.contexts[0]

--> HTTP 500

from the logs:

sqlalchemy.exc.StatementError: (builtins.ValueError) bytes is not a 16-char string 
[SQL: 'SELECT contexts.created AS contexts_created, contexts.updated AS contexts_updated, contexts.id AS contexts_id, contexts.spec AS contexts_spec, contexts.jwt AS contexts_jwt 
FROM contexts 
WHERE contexts.id = %(param_1)s'] [parameters: [{'%(140547263475216 param)s': '0'}]]

create notebook images with renga installed

make several general notebook images with renga pre-installed based on https://hub.docker.com/r/jupyter/

workflows: fix serialisation of File and Path

renga workflow create <FILE> generates invalid CWL due to wrong YAML serialisation.

include JSON-LD context with all metadata

include contexts with:

project
~~CWL steps/workflows~~
~~.renga metadata~~ (is included in project)
datasets

nested serialization to be handled via #119

fix renga notebook cli

renga notebook fails
renga notebook show doesn't show the running notebooks
implement renga notebook stop
implement numbering of running notebooks for easier navigation?
implement automatic opening of chosen notebook in browser

will be done in #31

link notebook code to an execution

We need a way to link notebook code in the graph

enable launching of a specific notebook

allow e.g. the following:

$ renga runner notebook --notebook-path notebooks/my_notebook.ipynb

client = renga.from_env()

authors are removed after calling add (related to #119)
files/<NAME>/path is not serialized as str but pathlib.PosixPath
check how the target is joined to origin path (//)
warn when importing local git repository
adding a specific file without using -t doesn't work

CLI: dev documentation

renga io buckets throws an error

$ renga io buckets

Traceback (most recent call last):
  File "/Users/rok/.virtualenvs/renga/bin/renga", line 11, in <module>
    load_entry_point('renga', 'console_scripts', 'renga')()
  File "/Users/rok/.virtualenvs/renga/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/rok/.virtualenvs/renga/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/rok/.virtualenvs/renga/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/rok/.virtualenvs/renga/lib/python3.6/site-packages/click/core.py", line 1064, in invoke
    sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
  File "/Users/rok/.virtualenvs/renga/lib/python3.6/site-packages/click/core.py", line 621, in make_context
    self.parse_args(ctx, args)
  File "/Users/rok/Projects/renga-python/renga/cli/_group.py", line 28, in parse_args
    if args[0] in self.commands:
IndexError: list index out of range

swissdatasciencecenter / renku-python Goto Github PK

renku-python's Introduction

Renku Python Library, CLI and Service

Renku for Users

Installation

Prerequisites

pipx

pip

Windows

Docker

CLI Example

Renku as a Service

Deploying locally

API Documentation

Developing Renku

Service

Running tests

Using External Debuggers

Local Machine

renku-python's People

Contributors

Stargazers

Watchers

Forkers

renku-python's Issues

Recommend Projects

Recommend Topics

Recommend Org

`pipx`

`pip`