operate-first / apps Goto Github PK

Operate-first application manifests

License: GNU General Public License v3.0

Shell 35.40% Python 58.88% HCL 3.05% Jsonnet 2.68%

kustomize hacktoberfest hacktoberfest2022 kubernetes

apps's Issues

Rename repository to be more general

Currently we don't have another repo to store non-odh related apps. In the broader scheme of things, from the perspective of operate-first, odh is just another app, and I don't think we really need to have this entire repo be dedicated to odh only.

I propose we change this repo to be more generic and rename it to ~~operate-first-apps~~ apps, and move the odh contents in to a sub directory.

Namespace Structuring for ODH components

I'm not convinced that starting with a single namespace for all components makes sense. Per some discussions offline,

it makes moresense to deploy groups of components per namespace. i.e Jupyterhub + spark in one namespace, Superset + data catalog(once it's in) in another.

This way we can separate concerns per namespace. This will give us better control on quotas for namespaces/applications. It would also simplify management of the ODH (split logs across namespaces)

Need a GrafanaDatasource resource for Prometheus

We have set up Grafana, but we need to add a datasource (example resource) so that Grafana is connected to the monitoring prometheus instance.

Get ODH instance deployed in an operate first namespace using kfdef from odh override repo

ODH was added in moc cnv:
https://gitlab.com/open-infrastructure-labs/moc-cnv-sandbox/-/merge_requests/8

Create/Update prometheus monitors for different ODH components

First confirm with odh (likely via issue upstream) if there are plans to add such service monitors and the eta. Based on this feedback, we should discuss here whether we need to proceed with adding our own and sending them upstream.

[anishasthana] At a minimum, we should update/create monitoring for the following:

Add authentication via Openshift to Argo Workflows

Ideally via openshift

Create MOC rolebindings for component namespaces

We need to create MOC rolebindings for the operatefirst declaratively for each component / namespace, they should go here:
https://github.com/operate-first/apps/tree/master/odh/overlays/moc

Add ODH dashboard

@HumairAK @tumido can we easily add https://github.com/opendatahub-io/odh-dashboard#open-data-hub-dashboard ?

It'll be included in the 0.9 release of ODH, but having it now would ease user interaction

Test out giving a single prom instance rbac to multi namespaces

Since the odh-dashboard requires a single component instance installed via kfdef to pick up it's route, it would be ideal if we could have only one odh deployed prometheus instance via one odh deployed prometheus operator.

This issue is to test out if we can have service/pod monitors all deployed in a single namespace like opf-monitoring then give rbac to a single prometheus instance to other namespaces like opf-jupyterhub and opf-argo and monitor them from a single source.

Add authentication via Openshift to Superset

Figure out if Openshift can allow default view access for un-authenticated user usecase

In the case for Quicklab or General Openshift:

Any un-authenticated user can have view access to openshift

The purpose of this issue is to see if this possible

Prototype Hue table / Superset dashboard for categorical data

Add Secrets to Kustomizations

Please link all secrets that will need to be added as part of the kustomize builds in this repository. Once we have our secrets-management infrastructure in place, we'll use this thread as an aggregate point to update all kustomizations that need to be updated.

Need to add Kube State Metrics to the Monitoring Prometheus

Kube state metrics provide the individual pod resource usage metrics that we would like to have for our dashboards.

Add Prometheus alertmanager

Add Prometheus alertmanager to our monitoring stack

Add Data catalog from internal DH

Related PRs:

Related upstream PRs:

Related internal DH PRs:

Create ODH Service Monitor for Jupyterhub

Related: #11

Update directory address in kafka topic instructions

The instructions to create new Kafka topics need to be updated after the recent directory restructure.

Superset PVC error on moc

Superset is running into the following error on MOC:

running "VolumeBinding" filter plugin for pod "superset-1-f9zll": pod has unbound immediate PersistentVolumeClaims

The PVC shows the following error:

superset-data:
no persistent volumes available for this claim and no storage class is set

I'm guessing this is related to #22 - where JH had similar issues. But superset component doesn't seem to give the same method to change storage class via a parameter. We should try overriding this for now. If it works, suggest upstream to add a parameter to specify storage class.

Separating environment specific details

We are already starting to see some moc specific implementation (e.g. storage class for JH #22 and possibly superset #35). We need a way to separate these out from general deployments.

We could separate them out into bases/overlays (where overlays have folders like moc and quicklab). The problem is then our argocd-apps repo will become environment specific (since we specify a path). As a solution we could also extend the base/overlays structure to the argocd-apps repo. That's one suggestion, open to other ideas.

Point Data catalog to upstream manifests

Once ODH merges Data Catalog, switch OPF manifests repo pointers.

Related to: #32

Enable branch protection to prevent accidental pushes to master

I accidentally pushed a commit to master (shameful commit) and had to revert it. Can we disable commits to master in the org level?

[EPIC] Categorical encoding JH notebooks deployment

This is a master task for categorical encoding JH images deployment within the Operate First CD. The aim is to have a fully transparent, open source, deterministic deployment of a selected usecase.

Steps:

Set up a build pipeline for the public image aicoe-aiops/categorical-encoding#2
Pushing JH images to public quay aicoe-aiops/categorical-encoding#2
~~Deploying the build pipeline on operated-first cluster (not now)~~
Pulling it to operated-first JH operate-first/continuous-deployment#30
Cloning data to operated-first cluster https://github.com/operate-first/continuous-deployment/issues/28
Exposing the data to spark/Hue https://github.com/operate-first/continuous-deployment/issues/29
Packaging automating Superset dashboard creation on the operated-first cluster https://github.com/operate-first/continuous-deployment/issues/29

Add Thanos manifests using Observatorium

Add kfdefs including all odh components on operate-first/odh

All components should go in a single namespace, based on the examples for the kfdef files provided from upstream. This will require restructuring of the operate-first/odh repo.

Promtheus needs SA with extra permissions

We encountered an issue where prometheus could not see alertmanager on one of the quicklab clusters. It seems this is because the default SA that Prometheus gets assigned to did not have sufficient permissions to see alertmanager. Assigning prometheus a serviceAccountName in the .spec of the prometheus CR where the SA had project admin permissions solved this issue.

Deploy authenticator namespace with DEX

Since many of the components require OpenID and Openshift itself can't provide it, we need a translation layer - DEX. Let's deploy one instance instead of a separate for each component.

Related to #37 #39 #47 #48

Remove namespaces from base, transfer to overlays for odh app

We should remove the namespaces from base and assign these in the respective overlays, see #98 for more discussion.

Override Vs Overlay and CRD rejection issue

Related Issues:

Recently we've been coming across situations where we'd prefer to overlay instead of override resources. The difference being, when we overlay, we are having argocd deploy them, whereas with overriding in ODH, we have ODH deploy them.

Context:
So in some cases ODH deploys operators, which includes CRDs. Let's take for example KafkaTopics. Say I have ArgoCD deploy the KafkaTopics and a KFDef that deploys strimzi. Great, say ArgoCD tries to deploy KafkaTopics first, well it can't because it needs to deploy the KFDef first. So the application will fail.

We can use ArgoCD waves to tell ArgoCD to deploy our kfdef (and thus Strimzi) first, then deploy KafkaTopics second. This too will fail because the ODH operator needs to read the KFDEF and deploy the CRDs itself. So we need ODH to deploy to the KafkaTopics by using overrides if we want to prevent errors from showing up. \

Problem:
I personally don't know if ODH is at a point where we can confidently override certain resources and expect to have them be sync'ed immediately. I have personally encountered situations where ODH will take a while to update the resources, sometimes having to restart ODH itself. I would be interested to hear what your own experiences have been.

Another situation is, by being forced into override, we are forced into following the folder structure of the odh-manifests repo. For example if I want to override grafana dashboards, then I need to update this file here. But what if I want to add another dashboard? The way I understand it I would need to override the kustomization here. Okay, but then I have a kustomization, that looks like itt's pulling some files from this repo, and some files from another repo (and the other repo's existence isn't even clear right away by looking at this directory). It just seems messy to me.

Yet another issue is that we cannot see kfdef child resources any more, see here for more details. So we have no sync/visual control over these resources via argocd.

Enable monitoring for Superset

Related: #11

Scrape kube state metrics for Prometheus

Some of the SRE monitoring dashboards such as the SLI/SLO dashboard for JH, require additional kube state metrics which consist of some interesting metrics such as individual pod resource usage metrics.

Document/Automate copying categorical data analysis sample data set to ODH attached storage

Update monitoring manifests to be deployed in waves

When deploying the monitoring setup in quicklab (via argocd), ran into the following Application Sync Error:

The monitoring manifests should be updated to be deployed in waves: https://argoproj.github.io/argo-cd/user-guide/sync-waves/
so that the kfdef is deployed first, and then followed by the grafana/prometheus custom resources

cc @HumairAK @4n4nd

Add proxy for Prometheus instances

Add a proxy to the prometheus instance so that only authenticated users have access to it.

This can be done by adding a proxy container to the Prometheus config:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: odh-monitoring
  labels:
    app: prometheus
    prometheus: k8s
spec:
  serviceMonitorSelector: {}
  securityContext: {}
  ruleSelector: {}
  replicas: 2
  containers:
  - name: oauth-proxy
    image: openshift/oauth-proxy:latest
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 8443
      name: public
    args:
    - --https-address=:8443
    - --provider=openshift
    - --openshift-service-account=proxy
    - --upstream=http://localhost:9090
    - --tls-cert=/etc/tls/private/tls.crt
    - --tls-key=/etc/tls/private/tls.key
    - --cookie-secret=SECRET

Add monitoring for the loki deployment

Add service monitor for loki
Add new Prometheus deployment for opf-observatorium namespace

Documentation for instructions on adding new grafana dashboards

Based on our discussion in operate-first/support#1086, we should add documentation for adding new grafana dashboards
(also related to operate-first/sre#5)

Add authentication via Openshift to Hue

Prometheus deployed twice in the monitoring namespace

And we have also 2 routes pointing to the same service the same port:

And the service it points to has even 2 owners - those 2 prometheus instances from above...

Deploy all components added via kfdef #9 on MOC using same method as the one employed in #60

Add KfDefs and deploy all components added via kfdef #9 on MOC using same method as the one employed here

Create Operate-first gpg key, share with operate-first-admins

add ml-analysis argo workflows to automated deployment

maybe use https://github.com/aicoe-aiops/fedora-mailing-list-analysis

Enable monitoring for the Data Catalog

Related: #11
This should probably be split into separate issues for the individual data catalog components

Readme Out of Date

The repo has gone through an overhaul, and a lot of the path's/structure is currently out of date in the README.

Update Readme to include Override Procedure

As per the discussions in this issue thread here we should document / summarize the important bits of the discussion. Mainly we want to document the override process and include instructions on how one can override changes. We should also link to the examples presented and Vasek's blog post for further reading.

Grafana not deployed fully via ODH - missing from ODH Dashbard

ODH Dashboard requires all sub-components to be deployed in order to show the component on the Dashboard. We're violating that in our Grafana deployment, since we deploy only the grafana-cluster while grafana-operator is not deployed via kfdef and we rather use our own manifests for it. However that causes troubles for ODH Dahboard.

Solutions:

Deploy it via kfdef and provide overrides
Deploy dummy grafana-operator component to make ODH Dashboard happy.

I would prefer solution 1. WDYT?

operate-first / apps Goto Github PK

apps's Issues

Recommend Projects

Recommend Topics

Recommend Org