Giter Club home page Giter Club logo

meteor-operator's Introduction

Meteor Operator

Prerequisites

The cluster where the operator will be working on must have these components already deployed and available:

  • Tekton / Openshift Pipelines
  • cert-manager
  • A default storage class with dynamic PV provisioning

Custom Resources

Meteor

Meteor represents a repository that is built and deployed by this Operator.

apiVersion: meteor.zone/v1alpha1
kind: Meteor
metadata:
  name: demo
spec:
  url: github.com/aicoe-aiops/meteor-demo
  ref: main
  ttl: 100000 # Time to live in seconds, defaults to 24h

Development

General pre-requisites:

  1. A Kubernetes or OpenShift cluster, with the local KUBECONFIG configured to access it in the preferred target namespace.

    • for OpenShift: oc login, and use oc project XXX to switch to the XXX target namespace
    • for a quick local Kubernetes cluster deployment, see the kind instructions below
  2. The cluster must meet the prerequisites mentioned above (tekton, etc) The script script_nfs.bash can install the necessary storage class if you use Quicklab as your development environment (see the script header for requirements)

  3. To deploy the tekton pipelines and tasks CRE Operator depends on: make install-pipelines

Run operator locally

Interactive debugging

To debug/run the operator locally, while it's still connected to a cluster and listens to events from this cluster use these steps.

Setup

Prerequisites: dlv and VSCode

Install dlv via go get -u github.com/go-delve/delve/cmd/dlv

Launch

  1. Log in to your cluster via oc login
  2. Install CRDs via make install
  3. Start VSCode debugging session by selecting Meteor Operator profile

Quick build and deploy to a cluster

The following commands will build the image, push it to the container registry, and deploy the operator to the aicoe-meteor namespace of your currently configured cluster (as per your $KUBECONFIG):

If you want to use a custom image name/version:

export IMAGE_TAG_BASE=quay.io/myorg/meteor-operator
export VERSION=0.0.7

To build, push and deploy:

podman login ...  # to make sure you can push the image to the public registry

make docker-build
make docker-push
make install-pipelines
make deploy

Testing locally with envtest

see https://book.kubebuilder.io/reference/envtest.html for details, extract:

export K8S_VERSION=1.21.2
curl -sSLo envtest-bins.tar.gz "https://go.kubebuilder.io/test-tools/${K8S_VERSION}/$(go env GOOS)/$(go env GOARCH)"
sudo mkdir /usr/local/kubebuilder
sudo chown $(whoami) /usr/local/kubebuilder
tar -C /usr/local/kubebuilder --strip-components=1 -zvxf envtest-bins.tar.gz
make test SKIP_FETCH_TOOLS=1 KUBEBUILDER_ASSETS=/usr/local/kubebuilder ENABLE_WEBHOOKS=false

Deploying a local cluster with kind

Using make kind-create will set up a local Kubernetes cluster for testing, using kind. make kind-load-img will build and load the operator container image into the cluster.

export T=$(kubectl -n kubernetes-dashboard create token admin-user) will get the admin user token for the Kubernetes dashboard.

Run kubectl proxy to expose 8001 and access http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ for the kubernetes dashboard.

Use kubectl port-forward -n tekton-pipelines service/tekton-dashboard 9097:9097 to expose the tekton dashboard, visit it at http://localhost:9097/

Deploying a local cluster with microshift

A note on MicroShift versions

At the time of this writing (2022-09-07):

  • upstream/published versions of microshift have not been updated for a while, and they are based on a relatively old version of Kubernetes, v1.21.
  • current versions of the Tekton operator require Kubernetes v1.22 or later (apparently inheriting the requirement from knative?)
  • the development version of MicroShift is based on OpenShift 4.10, which meets the version requirements:
$ oc version
Client Version: 4.10.0-202207291637.p0.ge29d58e.assembly.stream-e29d58e
Kubernetes Version: v1.23.1

Creating a MicroShift cluster

Follow the instructions of the MicroShift development environment on RHEL 8 documentation to set up MicroShift on a virtual machine.

Known issues

  • As of 2022-09-07, an existing bug on microshift assumes a pre-existing, hardcoded, LVM Volume Group available on the VM, named rhel. One quick way to get that is to add an additional virtual disk to your VM (say, /dev/vdb) and create the volume group on top of it:
sudo lvm pvcreate /dev/vdb
sudo lvm vgcreate rhel /dev/vdb

Getting access to the MicroShift cluster from the host

Assuming that your VM is cre.example.com and that the cloud-user user already has its client configured, you just need to copy the kubeconfig file locally:

scp [email protected]:.kube/config /tmp/microshift.config
sed -i -e s/127.0.0.1/cre.example.net/ /tmp/microshift.config
export KUBECONFIG=/tmp/microshift.config

Deploying the required components on MicroShift

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.8.0/cert-manager.yaml

# Install tekton-pipelines
oc create ns tekton-pipelines
oc adm policy add-scc-to-user anyuid -z tekton-pipelines-controller
oc adm policy add-scc-to-user anyuid -z tekton-pipelines-webhook
# FIXME: ugly workaround for fixed UID and version pin due to deprecations in kube 1.25+
curl -s https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.39.0/release.notags.yaml | grep -vw 65532 | oc apply -f-
# FIXME: should deploy standard ClusterTasks properly
curl -s https://api.hub.tekton.dev/v1/resource/tekton/task/openshift-client/0.2/raw | sed -e s/Task/ClusterTask/ | oc apply -f-

Testing the operator against an existing cluster

  1. make install-pipelines will deploy the Tekton pipelines' manifests
  2. make install will deploy all our CRD
  3. make run will run the controller locally but connected to the cluster.

meteor-operator's People

Contributors

codificat avatar goern avatar harshad16 avatar kpostoffice avatar sesheta avatar tumido avatar vannten avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meteor-operator's Issues

Handling spec changes in the CNBi crd (packages changes/other)

Problem statement

The user could edit a CNBi (to add or remove packages for example, or changing the base image)
But we might want to keep the old image and ImageStream around (notebook contains actually running ?)

Proposal description

Using an intermediate CR representing one version of the CNBi.
To use an analogy, this would be the same relation that exists between a Deployment and its ReplicaSet

Alternatives

  • not handle separate version of the same CNBi at all.
  • other ideas to handle that in a different ways (ideas welcome)

Additional context

We talked (a bit) about that during today CNBi.

Acceptance Criteria

  • User can change a CNBi resource (.spec) while keeping old version around

/wg cnbi
/priority backlog
/kind feature

sync all the pipelines

Make sure all the cnbi pipelines are consistent in how builds work

Focus for mvp is python/jupyter images

Create Shower CRD

Add a new CR that:

  • Deploys Shower UI
  • Is singleton per namespace
  • All Meteors within that namespace belongs to it
  • Will holds default and global configs.
apiVersion: meteor.operate-first.cloud/v1alpha1
kind: Shower
metadata:
  name: default
spec:

Future work:

  1. Make Shower deploy and manage Shower UI, no CR config required, this will be turned on by default
  2. Extend with global config for external services that are sourced by each meteor - like which namespace is JH located at
  3. Extend with default configs like workspace specification
  4. Extend default configs with optional specifications like push-to-quay secrets etc.

Current pipeline definitions do not match the controller parameters

Bug description

The current Pipeline manifests in pipelines/ do not work with the CNBi controller because their parameter names differ

Steps to Reproduce

Steps to reproduce the behavior:

  1. Deploy the current controller
  2. Create a CNBi resource, e.g. config/samples/meteor_v1alpha1_customnbimage_image-import.yaml
  3. Check the PipelineRun that the controller creates for the CNBi resource
  4. See error

Actual behavior

$ tkn pr describe -L
Name:              cnbi-s2i-minimal-py38-notebook-import-import
Namespace:         aicoe-meteor
Pipeline Ref:      cnbi-import
Service Account:   pipeline
Timeout:           1h0m0s
Labels:
 app.kubernetes.io/created-by=byon
 tekton.dev/pipeline=cnbi-import

๐ŸŒก๏ธ  Status

STARTED          DURATION    STATUS
25 seconds ago   0 seconds   Failed(ParameterMissing)

๐Ÿ’Œ Message

PipelineRun aicoe-meteor parameters is missing some parameters required by Pipeline cnbi-s2i-minimal-py38-notebook-import-import's parameters: PipelineRun missing parameters: [url desc]

โš“ Params

 NAME            VALUE
 โˆ™ name          s2i-minimal-py38-notebook
 โˆ™ creator       goern
 โˆ™ description   minimal notebook image for python 3.8
 โˆ™ baseImage     quay.io/thoth-station/s2i-minimal-py38-notebook:v0.2.2

๐Ÿ“‚ Workspaces

 NAME           SUB PATH   WORKSPACE BINDING
 โˆ™ data         ---        VolumeClaimTemplate
 โˆ™ sslcertdir   ---        ConfigMap (config=openshift-service-ca.crt,item=service-ca.crt=ca.crt)

Expected behavior

PipelineRun can start.

Environment information

Additional context

The pipeline definitions previously under hack/ (removed by #143 ) had their parameters modified to match the controller's expected parameter names.

Reconcile Comas dynamically

Currently a list of namespaces is hardcoded to ["opf-jupyterhub"] here:

https://github.com/AICoE/meteor-operator/blob/13283691e4e86e85db24383b4ded4922fcf82915/controllers/meteor/coma_reconciler.go#L15

Leverage reconciler property r.Shower to load it dynamically from externalServices spec.

https://github.com/AICoE/meteor-operator/blob/13283691e4e86e85db24383b4ded4922fcf82915/controllers/meteor/meteor_controller.go#L48
https://github.com/AICoE/meteor-operator/blob/13283691e4e86e85db24383b4ded4922fcf82915/api/v1alpha1/shower_types.go#L46

The end goal is to populate namespaces variable with all namespaces available in the externalServices[*].namespace

make each pipeline run name unique

The PipelineRuns triggered by the CNBi controller are named after their corresponding cnbi object name and buildType.

Two PipelineRuns from the same object would have a name clash.

Each PipelineRun should have a unique name to prevent such a clash.

Acceptance criteria

  • each pipeline run created from the meteor operator has a unique name (possibly based on the build type, but with extra unique ids)

persistent meteors

As a User of the Meteor Web Site,
I want to check a box,
so that a deployed meteor is not garbage collected
and so that the URLs can be shared

Acceptance Criteria

  • checkbox to request "persistent" is visible on UI

Some alternatives:

  • make the timeout configurable (24hr, 1w, never)
  • add CNAME routing for nice URLs
  • add deployed meteor to a catalog.metor.zone website
  • make the timeout and URL changes after the meteor is deployed

add odh label

As Gage pointed out in chat, we need to add another label 'opendatahub.io/notebook-name: true' to the imagestream created by CRE

/kind feature
/wg cre
/priority critical-urgent

Collect results from Tekton pipelines

Each pipeline should produce a link/target that is accessible by the user. Operator should collect those results and Shower should be able to showcase them

Move ImageStream creation/handling from Pipelines to Operator

High-level Goals

  • Directly handles ImageStream inside the Operator
  • No interactio between the Pipelines and the Openshift API
  • Optionnaly: Use ImageStream to determine CNBi state (ImageAvailable)

Proposal description

The ImageStream is created as part of the reconcilying loop and used as reconcile trigger

Alternatives

Keep the current situation.

Additional context

This was mentionned during CNBi meeting (9 Nov 2022)
Note that using ImageStream as the "impletementor" of CNBi rather than Pipeline might make more sense.

Consider the following scenarios:

  • CNBi is created => Pipeline gets run => ImageStream created && image pushed.
    => What if the ImageSteam is deleted (by the user) ? In that case, a K8S operator should reconcile, and that means recreating the ImageStream
    => What if the underlying image is removed from the internal registry ? In that case, we'd need a new pipeline to rebuild the image
    Note that both of these scenario do not imply a change of the PipelineRun resource, and will not (currently) trigger a reconcile request in the CNBi controller.

In a more theorical way, we should care about the result (= the ImageStream), not the process (= the PipelineRun) when doing level-based reconcile (the words of k8s docs).

Acceptance Criteria

  • CNBi controller create ImageStream with proper OwnerRefs
  • (Optionnal, might be in another issue)[ ] Reconcile using ImageStream state

gitRepo build pipeline is not working

Bug description

The current version of the cre-gitrepo pipeline (that implements the GitRepository build type) is not working.

Steps to Reproduce

Steps to reproduce the behavior:

  1. Create a CustomRuntimeEnvironment of type GitRepository (for example, the one in config/samples)
  2. Check the related PipelineRun
  3. See error

Actual behavior

$ tkn pr describe -L
Name:              cre-elyra-aidevsecops-tutorial-gitrepo
Namespace:         aicoe-meteor
Pipeline Ref:      cre-gitrepo
Service Account:   pipeline
Timeout:           1h0m0s
Labels:
 cre.thoth-station.ninja/pipeline=gitrepo
 tekton.dev/pipeline=cre-gitrepo

๐ŸŒก๏ธ  Status

STARTED         DURATION    STATUS
5 seconds ago   0 seconds   Failed(InvalidTaskResultReference)

๐Ÿ’Œ Message

invalid pipeline result "url": referenced pipeline task "url-to-results" does not exist

โš“ Params

 NAME            VALUE
 โˆ™ name          Elyra DevSecOps Tutorial
 โˆ™ creator       codificat
 โˆ™ description   Build from the Elyra Tutorial
 โˆ™ url           https://github.com/AICoE/elyra-aidevsecops-tutorial
 โˆ™ ref           master

๐Ÿ“‚ Workspaces

 NAME           SUB PATH   WORKSPACE BINDING
 โˆ™ data         ---        VolumeClaimTemplate
 โˆ™ sslcertdir   ---        ConfigMap (config=openshift-service-ca.crt,item=service-ca.crt=ca.crt)

Expected behavior

The PipelineRun created for the git repo CRE works.

Environment information

meteor-operator v0.2.1

Additional context

Noticed this while reviewing thoth-station/odh-dashboard#14

This is not a use case in the initial CNBi functionality, and therefore it's optional in that sense for the initial version.

PackageList build reports as Failed after the pipeline succeeds

I created a Custom Runtime Environment of type PackageList. The associated pipeline run completed successfully, but at the end the CRE resource switched to phase Failed instead of Succeeded:

$ oc get cre
NAME                PHASE       AGE
cre-1670932884507   Failed      6h36m

$ oc get pr
NAME                                 SUCCEEDED   REASON      STARTTIME   COMPLETIONTIME
cre-cre-1670932884507-package-list   True        Succeeded   6h36m       6h34m

This is the CRE:

apiVersion: meteor.zone/v1alpha1
kind: CustomRuntimeEnvironment
metadata:
  annotations:
    opendatahub.io/notebook-image-creator: admin
    opendatahub.io/notebook-image-desc: mimic config/samples/meteor_v1alpha1_customruntimeenvironment_package-list-1.yaml
    opendatahub.io/notebook-image-name: minimal pandas
  creationTimestamp: "2022-12-13T12:01:24Z"
  generation: 1
  labels:
    app.kubernetes.io/part-of: meteor-operator
  name: cre-1670932884507
  namespace: aicoe-meteor
  resourceVersion: "802798"
  uid: eeee224a-af0a-466b-bebc-8fee1eed3b97
spec:
  baseImage: image-registry.openshift-image-registry.svc:5000/aicoe-meteor/s2i-minimal-notebook:v0.3.0-py38
  buildType: PackageList
  imagePullSecret:
    name: ""
  packageVersions:
  - pandas
  - boto3>=1.24.0
  runtimeEnvironment: {}
status:
  conditions:
  - lastTransitionTime: "2022-12-13T12:03:59Z"
    message: The PipelineRun has been completed successfully.
    observedGeneration: 1
    reason: PipelineRunCompleted
    status: "True"
    type: PipelineRunCompleted
  observedGeneration: 1
  phase: Failed
  pipelines:
  - name: cre-1670932884507
    pipelineRunName: cre-cre-1670932884507-package-list
    ready: "False"

[ImportImage] implement check for required secret

there is no check yet for the "import from private repository" use case, we could do that before creating the PipelineRun, it should set the RequiredSecretMissing condition if the references secret does not exist, if a subsequent reconciliation loop is able to get the secret, the condition could be removed, and the pipelinerun created.

checkbox to create PR to repository about Thoth advice

As a User of the Meteor Web Site,
I want to check a box,
so that a Pull Request is created on the repository I submitted,
and so that the PR give interesting insights on my project.

Acceptance Criteria

  • checkbox to request "create Thoth advise" is visible on UI
  • tekton task is triggered to create thoth PR
  • PR is created on user's repository
  • the interface between the operator and features is a tekton pipeline

in favor of AICoE/meteor#92

Consolidate metrics into single push step

It's not dry for each reporting step to contain prometheus client and push metrics to the push gateways by itself. It cluters the code and can be simplified:

  • Define all metrics in a single task
  • Other tasks can populate metric values as files shared on a workspace
  • A single task in finally (ensured to run no matter if pipeline fails or succeeds) pushes the metrics to the push gateway

Benefit:

  • all custom metrics emitted from a pipeline are consolidated in one step, hence one source of truth
  • metrics are not duplicated
  • prometheus client is required in a single step (less clutter in thoth reporting etc)

make all the ImageStreams have a proper owner ref to their CNBi so they get cleaned up

When a cnbi object is created and eventually an ImageStream gets created for it, the ImageStream should have a proper ownerReference that relates it to its corresponding cnbi object.

With that in place, deleting a CustomNBImage object would also result in the clean up of its associated ImageStream.

Acceptance criteria

  • all pipelines create imageStreams with a proper ownerRef
  • the controller passes the necessary parameters to the PipelineRun

`make test` fails locally (this might be non-reproducible)

Bug description

$ make test
[...]
[BeforeSuite] [FAILED] [15.021 seconds]
[BeforeSuite] 
/home/goern/Work/thoth-station/meteor-operator/api/v1alpha1/webhook_suite_test.go:62

  Begin Captured GinkgoWriter Output >>
    STEP: bootstrapping test environment 09/30/22 10:38:46.69
    1.6645271266909258e+09      DEBUG   controller-runtime.test-env     starting control plane
    1.6645271315763607e+09      DEBUG   controller-runtime.test-env     reading Webhooks from path      {"path": "../../config/webhook"}
    1.6645271315767803e+09      DEBUG   controller-runtime.test-env     read webhooks from file {"file": "kustomization.yaml"}
    1.6645271315771356e+09      DEBUG   controller-runtime.test-env     read webhooks from file {"file": "kustomizeconfig.yaml"}
    1.664527131580171e+09       DEBUG   controller-runtime.test-env     read webhooks from file {"file": "manifests.yaml"}
    1.6645271315803628e+09      DEBUG   controller-runtime.test-env     read webhooks from file {"file": "service.yaml"}
    1.66452713158077e+09        DEBUG   controller-runtime.test-env     installing CRDs
    1.6645271315808141e+09      DEBUG   controller-runtime.test-env     reading CRDs from path  {"path": "../../config/crd/bases"}
    1.6645271315820005e+09      DEBUG   controller-runtime.test-env     read CRDs from file     {"file": "meteor.zone_comas.yaml"}
    1.6645271315836825e+09      DEBUG   controller-runtime.test-env     read CRDs from file     {"file": "meteor.zone_customnbimages.yaml"}
    1.6645271315849469e+09      DEBUG   controller-runtime.test-env     read CRDs from file     {"file": "meteor.zone_meteorconfigs.yaml"}
    1.6645271315866168e+09      DEBUG   controller-runtime.test-env     read CRDs from file     {"file": "meteor.zone_meteors.yaml"}
    1.6645271315879364e+09      DEBUG   controller-runtime.test-env     read CRDs from file     {"file": "meteor.zone_services.yaml"}
    1.6645271315912976e+09      DEBUG   controller-runtime.test-env     read CRDs from file     {"file": "meteor.zone_showers.yaml"}
    1.6645271316106536e+09      DEBUG   controller-runtime.test-env     installing CRD  {"crd": "comas.meteor.zone"}
    1.6645271316206696e+09      DEBUG   controller-runtime.test-env     installing CRD  {"crd": "customnbimages.meteor.zone"}
    1.664527131629209e+09       DEBUG   controller-runtime.test-env     installing CRD  {"crd": "meteorconfigs.meteor.zone"}
    1.6645271316373017e+09      DEBUG   controller-runtime.test-env     installing CRD  {"crd": "meteors.meteor.zone"}
    1.6645271316711001e+09      DEBUG   controller-runtime.test-env     installing CRD  {"crd": "services.meteor.zone"}
    1.6645271316926992e+09      DEBUG   controller-runtime.test-env     installing CRD  {"crd": "showers.meteor.zone"}
    1.6645271317074015e+09      DEBUG   controller-runtime.test-env     adding API in waitlist  {"GV": "meteor.zone/v1alpha1"}
    1.664527131707455e+09       DEBUG   controller-runtime.test-env     adding API in waitlist  {"GV": "meteor.zone/v1alpha1"}
    1.6645271317074676e+09      DEBUG   controller-runtime.test-env     adding API in waitlist  {"GV": "meteor.zone/v1alpha1"}
    1.664527131707476e+09       DEBUG   controller-runtime.test-env     adding API in waitlist  {"GV": "meteor.zone/v1alpha1"}
    1.6645271317074847e+09      DEBUG   controller-runtime.test-env     adding API in waitlist  {"GV": "meteor.zone/v1alpha1"}
    1.6645271317074955e+09      DEBUG   controller-runtime.test-env     adding API in waitlist  {"GV": "meteor.zone/v1alpha1"}
  << End Captured GinkgoWriter Output

  Unexpected error:
      <*fmt.wrapError | 0xc0002b4a00>: {
          msg: "unable to install CRDs onto control plane: something went wrong waiting for CRDs to appear as API resources: timed out waiting for the condition",
          err: <*fmt.wrapError | 0xc0002b49c0>{
              msg: "something went wrong waiting for CRDs to appear as API resources: timed out waiting for the condition",
              err: <*errors.errorString | 0xc0002d7190>{
                  s: "timed out waiting for the condition",
              },
          },
      }
      unable to install CRDs onto control plane: something went wrong waiting for CRDs to appear as API resources: timed out waiting for the condition
  occurred
  In [BeforeSuite] at: /home/goern/Work/thoth-station/meteor-operator/api/v1alpha1/webhook_suite_test.go:79

[...]

Steps to Reproduce

Steps to reproduce the behavior:

  1. gh clone and cd this repo
  2. dnf install golang
  3. make test
  4. See error

Actual behavior

see above

Expected behavior

test succeed

Environment information

n/a

Additional context

n/a

ImageStream annotations disappear if an import pipeline fails

Bug description

When testing an Image Import CRE for an image that does not meet jupyterlab requirements, the pipeline's validation task correctly identifies the problem, but the update it does to the ImageStream results in annotations disappearing.

Steps to Reproduce

Steps to reproduce the behavior:

  1. create a CRE to import this image: quay.io/thoth-station/s2i-minimal-notebook:v0.0.15. The image does not meet the requirement of python >= 3.8
  2. watch the corresponding pipelinerun

Actual behavior

At the beginning of the pipeline, an ImageStream gets created that includes the annotations: opendatahub.io/notebook-image-{name,desc,creator,...}.

Once the pipeline completes, the ImageStream only has these 2 annotations:

  • opendatahub.io/notebook-image-messages=[... Python version does not meet minimal required version ...]
  • opendatahub.io/notebook-image-messages=Failed

Expected behavior

The annotations that contain the name, description etc are present in the generated ImageStream

Environment information

meteor-operator v0.2.1

Additional context

This causes the CRE to loose its name in the odh-dashboard

Have the CNBi operator trigger the PackageList pipeline when a CR of that type is created

One of the CustomNBImage's buildTypes is PackageList (example).

The CNBi operator needs to trigger the right pipeline for this build type when a CR of that type is created.

Acceptance criteria

  • PipelineRun(s) for the correct Pipeline(s) are created when a CNBi CR with buildType: PackageList is created
  • the CNBI CR status is correctly updated throughout the execution of the pipelines
  • errors are being properly handled

Additional information

The pipeline that implements the logic for that build type is being developed in thoth-station/helm-charts#50

review deployment of pipelines to be backward compatible

Problem statement

the pipelines of currently deployed meteor-operator on smaug need updates. this should be done via deploying a new version of metor-operator >v0.2.0

might depend on #108

High-level Goals

update existing deployments of pipelines/tasks/...

Proposal description

n/a

Alternatives

none

Additional context

operate-first/hitchhikers-guide#130

Acceptance Criteria

/kind feature
/wg cnbi
/wg cre
/sig devsecops
/priority important-soon

Inconsistent phase between CRE and ImageStream

Bug description

The Phase of a custom environment gets inconsistent between a CustomRuntimeEnvironment and associated ImageStream.

Steps to Reproduce

Steps to reproduce the behavior:

  1. Create a CRE of type import
  2. Set the image to import to quay.io/repository/thoth-station/elyra-aidevsecops-training:v0.14.1. That image does not meet the requirements
  3. Let the associated PipelineRun complete

Actual behavior

$ oc get cre cre-1673945925026 -o jsonpath='{.status.phase}'
Succeeded

$ oc describe is cre-cre-1673945925026-import | grep phase=
			opendatahub.io/notebook-image-phase=Failed

Expected behavior

The phase information between CRE and ImageStream should match.

Environment information

meteor-operator v0.2.1

Additional context

Logs for the pipelinerun for the example image import above, where the import is marked as failed because it doesn't meet the expected requisites:

[validate : get-package-versions] Analyzing image:
[validate : get-package-versions]     Python:             3.8.6
[validate : get-package-versions]     Python packages:    boto3==1.18.42 matplotlib==3.4.3 tensorflow==2.6.0 thamos==1.18.1
[validate : get-package-versions]     R (optional):       Not available
[validate : get-package-versions]     RStudio (optional): Not available

[validate : minimal-requirements] Validating minimal requirements:
[validate : minimal-requirements]     Python >= 3.8.0 โœ…
[validate : minimal-requirements]     'jupyterhub' package is present โŒ
[validate : minimal-requirements]     'jupyterlab' package is present โŒ
[validate : minimal-requirements]     $HOME is writeable โœ…
[validate : minimal-requirements]     start-singleuser.sh is present  โŒ
[validate : minimal-requirements]     start-singleuser.sh must execute 'jupyter'  โŒ
[validate : minimal-requirements]     start-singleuser.sh must start 'labhub' environment  โš 
[validate : minimal-requirements]     start-singleuser.sh must accept runtime args or pass a config file  โš 

[update-imagestream : oc] Updating image stream:
[update-imagestream : oc]   Name:       cre-cre-1673945925026-import
[update-imagestream : oc]   Namespace:  aicoe-meteor
[update-imagestream : oc] 
[update-imagestream : oc]   Image tag:           v0.14.1
[update-imagestream : oc]   Phase:               Failed
[update-imagestream : oc]   Messages:            [{"severity":"error","message":"Missing 'jupyterhub' python package"},{"severity":"error","message":"Missing 'jupyterlab' python package"},{"severity":"error","message":"start-singleuser.sh script must be present in /opt/app-root/bin:/opt/app-root/src/.local/bin/:/opt/app-root/src/bin:/opt/app-root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin and executable"},{"severity":"error","message":"start-singleuser.sh script took too long to call 'jupyter labhub'"}]
[update-imagestream : oc]   Visibility:          false
[update-imagestream : oc]   Software:            [{"name":"Python","version":"3.8.6"}]
[update-imagestream : oc]   Python dependencies: [{"name": "boto3", "version": "1.18.42", "visible": true}, {"name": "matplotlib", "version": "3.4.3", "visible": true}, {"name": "tensorflow", "version": "2.6.0", "visible": true}, {"name": "thamos", "version": "1.18.1", "visible": true}]
[update-imagestream : oc] 
[update-imagestream : oc] imagestream.image.openshift.io/cre-cre-1673945925026-import patched
[update-imagestream : oc] imagestream.image.openshift.io/cre-cre-1673945925026-import annotated

CNBi conditions are not declarative

Problem statement

Many of the conditions defined in api/v1alpha/cnbi_conditions_consts.go describe state changes instead of state.
This goes against the recommendations in the kubernetes API conventions relative to conditions, particularly:

Condition type names should describe the current observed state of the resource, rather than describing the current state transitions.

Take a look at controllers/cnbi/controller.go:

meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.ErrorPipelineRunCreate,
    Status:  metav1.ConditionTrue,
    Reason:  "PipelineRunCreateFailed",
    Message: err.Error(),
})
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.PipelineRunCreated,
    Status:  metav1.ConditionTrue,
    Reason:  "PipelineRunCreated",
    Message: fmt.Sprintf("%s PipelineRun created successfully", pipelineRun.Name),
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.GenericPipelineError,
    Status:  metav1.ConditionTrue,
    Reason:  "PipelineRunGenericError",
    Message: err.Error()},
)
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.PipelineRunCompleted,
    Status:  metav1.ConditionTrue,
    Reason:  "PipelineRunCompleted",
    Message: "The PipelineRun has been completed successfully.",
})
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.ImageImportReady,
    Status:  metav1.ConditionTrue,
    Reason:  "ImageImportReady",
    Message: "Import succeeded, the image is ready to be used.",
})
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.PipelineRunCompleted,
    Status:  metav1.ConditionTrue,
    Reason:  "PipelineRunCompleted",
    Message: "The PipelineRun has been completed with a failure!",
})
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.ImageImportReady,
    Status:  metav1.ConditionFalse,
    Reason:  "ImageImportNotReady",
    Message: "Import failed, this could be due to the repository to import from does not exist or is not accessible",
...
meta.SetStatusCondition(&cnbi.Status.Conditions, metav1.Condition{
    Type:    meteorv1alpha1.ErrorBuildingImage,
    Status:  metav1.ConditionTrue,
    Reason:  "ErrorBuildingImage",
    Message: "Build failed!",

(this is the code in #129, but it only changes the syntax, not the conditions name or the reasons)
For nearly all the conditions, we use a Reason which is the same that the condition type; I think our Reasons are fine (describe state transitions), but our conditions types are not.

High-level Goals

Be more Kubernetes-like when implementing the Conditions.

Proposal description

List of possible alternative conditions:

BaseImageAvailable
ImageAvailable
ImageIsValid
ThothAdviseScheduled (see K8s API conventions on long state transitions)
ThothAdviseAvailable

Additional context

#126 is relevant; tackling the too together might be a good idea.
Consider in particular the following scenario:

  1. A CNBi has been created, the ImageStream has been created and the image built and pushed.
  2. The ImageStream get deleted. We should propagate that change, and none of the existing conditions fits. (we should also re-launch a pipeline to reconcile the state).

Acceptance Criteria

  • The cnbi controller use declarative conditions, with Reason used as state transitions indicators

References

/wg cnbi
/priority important-longterm
/group-programming

example yaml is invalid, but tests successful

as of #71 the tests are successful, for example, https://github.com/AICoE/meteor-operator/blob/65f8ee5b6c323c5c6e8246df27de4fbe7b5ddda9/controllers/customnbimage_controller_test.go#L49-L63 but the examples can not by applied

$ kubectl apply -f config/samples/meteor_v1alpha1_customnbimage.yaml
Error from server (Invalid): error when creating "config/samples/meteor_v1alpha1_customnbimage.yaml": CustomNBImage.meteor.zone "s2i-minimal-py38-sample-1" is invalid: spec.type: Required value
Error from server (Invalid): error when creating "config/samples/meteor_v1alpha1_customnbimage.yaml": CustomNBImage.meteor.zone "ubi8-py38-sample-1" is invalid: spec.type: Required value
Error from server (Invalid): error when creating "config/samples/meteor_v1alpha1_customnbimage.yaml": CustomNBImage.meteor.zone "ubi8-py38-sample-2" is invalid: spec.type: Required value
Error from server (Invalid): error when creating "config/samples/meteor_v1alpha1_customnbimage.yaml": CustomNBImage.meteor.zone "s2i-minimal-py38-notebook-import" is invalid: spec.type: Required value
Error from server (Invalid): error when creating "config/samples/meteor_v1alpha1_customnbimage.yaml": CustomNBImage.meteor.zone "elyra-aidevsecops-tutorial" is invalid: spec.type: Required value

it feels like [tm] it is related to the inlined struct at https://github.com/AICoE/meteor-operator/blob/65f8ee5b6c323c5c6e8246df27de4fbe7b5ddda9/api/v1alpha1/customnbimage_types.go#L114

/kind bug
/priority critical-urgent
blocks #71

Enable pipeline discovery through operator

Operator should discover all available meteor pipelines and sync them to the Shower resource status.

Operator should discover all tektonPipelines resources with given annotations and sync in into a new status property of Shower resource

It expects following annotaions on Pipeline resource:

kind: Pipeline
metadata:
  annotations:
    meteor.zone/pipeline: "true"
    shower.meteor.zone/default-state: "true"  # optional annotation, assume enabled if not set
spec:
    description: "foo bar baz"
kind: Shower
status:
  pipelines:
    - name: <name>
      defaultState: "true"
      description: "foo bar baz"
  1. Create new status field in ShowerStatus implemented as an Array of PipelineSpec with mandatory fields name (string) and enabled (bool)
  2. Add a function which discovers all Pipeline resources matching given annotation selector
  3. Populate the status field

Make local dev work with openshift and kubernetes 1.25

Some of the functionality (e.g. manipulating ImageStreams for JupyterHub) requires OpenShift. The current dev instructions in the cnbi spike branch cover kind as a dev platform, which is not enough for this.

Furthermore, I just did a quick test of make kind-start using the latest version of kind (0.15.0, which brings in Kubernetes 1.25) and the outpout suggests there might be some issues with Tekton:

$ make kind-start
enabling experimental podman provider
No kind clusters found.
Creating Cluster
kind create cluster --name "meteor-cnbi" --config hack/kind-config.yaml
enabling experimental podman provider
Creating cluster "meteor-cnbi" ...
 โœ“ Ensuring node image (kindest/node:v1.25.0) ๐Ÿ–ผ 
 โœ“ Preparing nodes ๐Ÿ“ฆ  
 โœ“ Writing configuration ๐Ÿ“œ 
 โœ“ Starting control-plane ๐Ÿ•น๏ธ 
 โœ“ Installing CNI ๐Ÿ”Œ 
 โœ“ Installing StorageClass ๐Ÿ’พ 
Set kubectl context to "kind-meteor-cnbi"
You can now use your cluster with:

kubectl cluster-info --context kind-meteor-cnbi

Have a nice day! ๐Ÿ‘‹
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.8.0/cert-manager.yaml
namespace/cert-manager created
[snip multiple resources created]
service/tekton-pipelines-webhook created
resource mapping not found for name: "tekton-pipelines" namespace: "" from "https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"
ensure CRDs are installed first
resource mapping not found for name: "tekton-pipelines-webhook" namespace: "tekton-pipelines" from "https://storage.googleapis.com/tekton-releases/pipeline/latest/release.yaml": no matches for kind "HorizontalPodAutoscaler" in version "autoscaling/v2beta1"
ensure CRDs are installed first
make: *** [Makefile:247: kind-start] Error 1

Create MeteorComa CRD: Allow garbage collection of resources across namespaces

Set of external namespaces will be hardcoded for now (aicoe-meteor vs opf-jupyterhub).

Owner references can't be cross namespace in k8s, hence we need a shadow MeteorComa resource that will be deployed to those other namespaces and tying all resources for the original Meteor in that external namespace to itself.

Required for proper GC on Meteor deletion.

Relations:

  • Meteor owns Pipelineruns
  • Meteor owns all resources deployed by the pipeline (see #10 )
  • Pipelinerun owns all tekton related content
  • MeteorComa owns all tekton deployed content in external namespace
  • Meteor has a finalizer awaiting deletion of MeteorComas which belong to this Meteor.

create CRD on pipeline location

Problem statement

There are multiple ways to maintain the pipeline/task declarations required by the operator, lets document where we put them and why

High-level Goals

CRD documenting our arch decision

Proposal description

  1. keep the pipeline/task declaration yamls in the operator at config/tekon so that they are deployed via customize during make deploy

Alternatives

  1. keep them in a separate repo or dir with kustomize being able to deploy them
  2. put them in a helm chart

Additional context

n/a

Acceptance Criteria

  • CRD committed to meteor-operator repo

Improvements to the cnbi-package-list pipeline

Problem statement

The new Pipeline to implement the package list build type is missing a couple of features:

  • annotations for the content, so that ODH can show which software and packages are part of the built image
  • an owner reference for the created ImageStream

High-level Goals

Improve the ImageStream and ImageStreamTag created by the package list pipeline.

Proposal description

  • Add annotations
  • Add owner reference

Acceptance Criteria

  • the ImageStream created by the package list pipeline contains an owner reference pointing at the PipelineRun that created it, so that deleting the PR also deletes the ImageStream
  • the ImageStreamTag contains annotations that allow the ODH dashboard to show the relevant software and packages contained in the image

Optionally / additionally:

  • there is a sample PipelineRun in the repository that allows standalone testing of the pipeline

Allow selective on/off of individual pipelines in Shower config

Follow up after #52

Enhance the Shower spec from previous issue with:

kind: Shower
spec:
  pipelines:
    - name: <name>
      defaultState: "true"  # Has precedence over Pipeline resource annotation 
      enabled: "true"  # If true, this pipeline can't be submitted even if selected in the Meteor spec 
status:
  pipelines:
    - name: <name>
      defaultState: "true" 
      enabled: "false"

Fix scaling issues with buildah

Buildah requires quite a lot resources to complete a build. All tekton tasks without a specifically set resource requirements default to default LimitRange.

  1. Specify larger request for build and push steps
  2. Specify as low as possible requests/limits for other steps to make scheduling faster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.