operate-first / blueprint Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 17.0 1.79 MB

This is the blueprint for the Operate First Initiative

License: GNU General Public License v3.0

architectural-decisions blueprint hacktoberfest

blueprint's People

Contributors

Stargazers

Watchers

Forkers

goern durandom michaelclifford humairak tumido martinpovolny 4n4nd hemajv ipolonsk harshad16 margarethaley mimotej adr-manager-anonymous julian-alarcon schwesig lucamilan liviob74

blueprint's Issues

Discuss ideas for grafana dashboard collaboration

Discuss ideas for grafana dashboard collaboration (opf, RHODS, upstream ODH). We need to have a single source of truth for the grafana dashboards being developed across our teams.

Add ADR for secrets-management with ksops/sops/argocd

Related: here

ADR for approval/lgtm and PR lifecycle

After discovery of thoth-station/thoth-application#1215 it seems to me we need a better guidance on how to use approve and lgtm and hold and what is expected to happen after either is applied. I'll sum it up into an ADR doc.

Create User Story: I want to know how much local storage(PVC) I have

Re-structure userstories to consolidate all feature stories in one file or location.

We would like features to be in one location, either in one file or grouped under one directory to tell one full story for feature as opposed to organizing by personas.

Create User Story: I would like to interface a Ceph bucket from my Jupyter notebook

Visibility on Collaboration

Hi Folks,

Humair and I were talking earlier today about highlighting visibility into the collaboration we are having between the IDH, Operate First, and ODH teams. I think the engineers on the projects can generally agree that we have a good amount of shared development efforts and knowledge sharing but it is hard to quantifiably state that. Does anyone here have any ideas on improving visibility on this?

It'd be cool if there was some existing tool that lets you see interactions between organizations on Github, but a cursory google search didn't seem to surface anything.

cc @HumairAK @tumido @4n4nd @durandom @vpavlin @accorvin

ADR MOC SSO -> Keycloack

Plan on transition from MOC SSO and dex (as a lightweight OpenID Connect bridge) to a Keycloack and do IDM ourselves.

We would like to outphase MOC SSO as the only identity provider and replace it by Keycloak with Google/Github/other identity aggregators and connectors. This way we can have broader audience. MOC SSO can remain as an option - we would connect to it via Keycloak.

Create a user story for adding an imagestream

Add ci files to this repo and run linting

PR checks keep failing because the ci files do not exist in this repo, and all markdown files are not properly linted due to this.

Discuss broader end goals for CI and e2e testing for the operate-first initiative.

We had discussions during sprint planning for having a more defined broader goal (or set of goals) when it comes to e2e testing. The purpose of this is to serve as a sort of way to gauge the more granular steps/issues/changes we will be applying and ensuring that they are in line with our broader goals.

I think we can consider this issue closed if we can answer some of the question:

What does the ideal e2e scenario look like for operate-first? (e.g. stage/prod clusters, multi step deployments, schema validations, etc.)

Please add your thoughts and other questions that we should be looking to answer.

ADR required for alerting setup

In operate-first/sre#19, we looked at configuring the GitHub alert manager for handling Prometheus alerts. We should create an ADR that outlines our alerting setup.

ADR for data license

We want to publish all collected data under an open source license.
Decide which one to use.

related issue #76

Create User Story: I would like to get a ceph bucket for my ML project

ADR for EULA

We need a EULA for accessing the environment

We don't want consumers of the operational data sign a EULA for accessing the data, but the data should be published under an open source license

related PRs #75 #74

Taking suggestions on github teams / team permissions for the operate-first organization

We never properly discussed what github teams we should have, and who should be in which team, and which team needs what permissions. Currently we have an OPs team where I just threw everyone into in the beginning and gave them all admin to every repo. This was to just get us going, but now I think we could use a bit more structure (especially since using github to do auth for our apps is still yet a possibility and not off the table). This is the current set up:

Let's clean this up and re-organize. So we need:

list of teams, who should be in each team
team permissions per repo

ADR for cluster scoped resources

RFE ApplicationSets and Directory Structre

Sections "ArgoCD App of Apps" needs to be rethought since ApplicationSets aims to solve this "bootstrapping" issue. Also, working with ApplicationSets will give you a new outlook on directory structure.

Will be happy to work with anyone to get this together.

Create userstory: I would like to configure Ceph in Hue

As a Data Catalog user, I would like to know how to configure Ceph with Hue. This would help with our visualization needs where we need to pull data from our Ceph buckets and create tables for it in Hue (which are later connected in Superset to create dashboards).

Enforce forked repos in collaboration guide

SSIA

Identifying and improving user expectations

@tumido, @4n4nd , and I had a brief discussion about what we currently think users/folks generally know when looking to bring their projects to operate-first environments, and what we would like them to know more about right off the bat.

Currently we believe most users know the following when first looking to gain access to operate-first environment/clusters:

Some general Openshift/k8s knowledge
Operate-first has 1 or more Openshift clusters where people can deploy their projects to
Possibly a link to the website & support repo

Some folks might no more, or even less, but we think that's basically what they know on average based on the types of issues/questions we encounter. If we use this as a baseline, then we can improve how quickly we can ramp up newcomers on all the key pieces of information to help them look in the right places & make the right decisions. The first step is to identify what these key pieces of information are, we believe some of these to be:

Links to onboarding templates
How to get a namespace on one of our managed clusters, with project-admin access for individuals and teams
We manage various services for the end-user (and have an exhaustive list of these services e.g. Jupyterhub, Kafka, etc.)
End-user should use our Argocd (we should also promote Argocd more)
We don’t allow cluster resources to be hosted in a repo outside this org. They must deploy cluster resources via a PR to the apps/cluster-scope location.
They will need to work with and understand the basics of kustomize to work with our repos
Toolbox is a thing to make it easier for newcomers to gain the required tooling

Goal of this issue is to identify more key pieces of informations in this thread. Then we can have a follow up thread on how to better organize our communication/docs to better market this information. Identifying the first pieces of information we would like people to see can help with how we can improve the website as well.

ADR for defintion and collection of workload data

copied from a mail by @tumido

We're working with various direct and indirect users:

Academic users doing research on the data
Engineers and operators working directly with the platform, deploying their workloads and consuming the data, their applications have their own users, described later as indirect users
Workshop users who are invited to the platform usually by engineers to "try something out", deploying applications or what not
Indirect consumers are users accessing applications deployed on top of the platform or are just presented with the data, not being aware that it originated from this platform.

This also includes users consuming our CI, since they can view logs hosted on our platform. These logs are results of CI job runs against their own code, though technically the logs are produced within our platform and hosted there. If anything goes wrong in the CI run, they are usually presented with snippets of what went wrong. Does this also require DUA?
Accessing hosted applications that allow users to browse or work on data within and/or outside of the platform.
Interfacing with APIs that are indirectly linked (cnamed) to our platforms providing recommendations etc (Project Thoth)
Interacting with our chat-ops in Github and Slack which is still an interaction with our platform

we have to define how users can opt-in or -out for the collection of their data.
it's similar to the operational data defined in #80

Add a file for organizing all our personas

We can start with Jupyterhub/Datascientist persona and Operations.

ADR required for authentication

With recent works on:

We should frame the auth architecture used.

Change default branches to main

Hey!

More and more open source projects are changing their default branches to be main instead of master. New repositories on Github already default to main, and Gitlab will soon be doing the same. Should we consider changing the defaults on operate-first repositories as well?

[yamllint] Document start settings

The question is, should we enforce --- document starts on YAML files or not?

This appeared here: operate-first/apps#628 (comment)
Attempt to solve this: operate-first/apps#639

I think this deserves a proper issue, rather than a few chat messages.

For enforcing:

Part of the standard
Less opinionated yamllint config (using default values)
We can cat multiple files if needed and docs are properly separated
Single YAML doc files are consistent with multidoc files

Against enforcing:

It's just clutter in most cases
Too verbose for single doc YAML files

Note: This change doesn't affect the apps repo only. It should be enforced equally on all repos across this organization.

Concern - ArgoCD application name conflicts

Aggregating many application across multiple repositories and deploying to multiple clusters increases a risk of an application being named the same. On the other hand the nature of the OCP platform itself limits us to unique application names only, because all the Application resources land in the same namespace.

The situation is even more unfortunate in cases when different app-of-apps would try syncing different application specs with the same name. ArgoCD would end up with 2 competing apps syncing "the same" resource.

The possibility of this happening grows with the amount of clusters and teams onboarded.

Proposed solution

Use `namePreffix` or `nameSuffix` in `kustomization.yaml` for different sections of `argocd-apps`

Similar to this PR: operate-first/argocd-apps#101

This works only for well behaving apps, since it's part of the manifests
Doesn't work for conflicts between app-of-apps - 2 different app of apps can still apply a resource with the same name

Use `Applicaton` resource parameters

Works exceptionally well for app-of-apps, since it operates on app of apps resource spec level: always makes all applications deployed via an app of apps to follow the naming scheme.
Independent from the manifests deployed by app of apps

https://argoproj.github.io/argo-cd/user-guide/kustomize/#kustomize

Result of a discussion on this issue will be submitted as an ADR.

Personas & User stories

As per our meeting discussions, I think it would be helpful if we considered creating user personas and user stories to gain better insight into the types of users we intend to accommodate. We already have a general understanding of our goals and missions, but this could help organize and categorize those needs onto several archetypes and help drive our decision making.

Given that our initiative is very contingent upon users using our tooling/docs/repos, this sort of user centered approach could be very beneficial for us.

I can identify at least 2 areas where could potentially benefit from having user personas/stories (you might ave more to add, so definitely open to suggestions):

Users of MOC
Users looking to emulate our setup

For both of these, I think we can have multiple personas. I was thinking a brief user story for each persona. The user story should be brief, 1-3 lines, and should document practical needs/motivations of the users. Ideally it should also help identify a roadmap for us.

I would imagine we don't want to go over board here and create 50 different personas, which would not be helpful. So maybe we should try to limit them to about 5-10.

So having said all that, I would like to ask you guys:

Do you think user personas / stories will be beneficial here? Would you find it useful?

If so then:

What would you like to see in such personas/stories?
Do you have a specific type of user/goal/need you want captured in such personas/stories?
Any thing else you would like to add?

The point of this discussion is to explore this type of user-centered design, gauge everyone's thoughts, and if we're all on board, create new issues to write such personas/userstories.

Revisit ADR 0006

Revisit monitoring ADR - define that:

We use 2 parallel monitoring solutions
We fully embrace user workload monitoring
We deploy a separate thanos/observatorium solution for comparison

ADR for s3/ceph

Describe how we choose Rook as the ceph/s3 provider

Related to operate-first/support#48

Collaborate on Jupyterhub Notebook PVC Resizing

A common failure scenario for JupyterHub users is notebook server PVCs running out of storage capacity. Ideally we can update the ODH operator to proactively increase PVC sizes when it sees that a user is close to running out of space. Since @HumairAK and @4n4nd did a POC on this some time ago it'd be great if we could leverage your experience!

Definition of the most open environment

We'll have environments that are connected to universities, as they're hosted in a university data center. Here the university superimposes legal requirements.

But we'll also have environments that are located in EMEA on-premise datacenters, such as the Hetzner setup.
How can we create an environment that is as open as possible?

One of the key components of a true public Operate First capability for open source communities is going to be the legal framework that governs the data.

grant public read-only access to all platform- and (potentially) workload configuration (excl. secrets) - think browsing open-source code
grant public read-only access to all platform- and (potentially) workload operational data (logs, metrics, issues) without the need to scrub the data - think browsing open-source code
grant authenticated read-write access to the environment with the least legal hurdles - e.g. by accessing you're accepting a EULA implicitly - think deploying workloads

We don't want to constrain how the data is accessed, i.e. people with access to the data might implicitly accept an agreement, best we want to make it available to the world so that no authentication is in place and even a web-search-crawler could access it.
The technical implementation is straightforward. But we have to stay within legal requirements, like GDPR.

The legal question would be: what are the agreements/licenses/etc we need to put in place and what is the nature of the data we're allowed to publish

The definition of operational data is tracked in #80

Create User Story: I want access to a Jupyterhub environment on MOC ODH

Separate Features into "Persona Directories"

We can start with "Operations" and "Data scientist" personas directories. The purpose of these directories will be to store the feature files that belong to each persona accordingly.

CRC vs Quicklab deployments

I'm seeing little to no difference between crc/quicklab environments, and at this point I wonder if it's worth constantly adding a new directory for both when most of the time they end up being the same (referencing base). Would it not be better to just have a "dev" folder instead? This would reduce a lot of unnecessary clutter.

If there is a significant between the two (in terms of how/what we deploy), feel free to mention it here -- this is simply what I've noticed 😃

Re-keying secrets if a private key gets lost

Let's discuss how do we plan to handle if we lose the Operate First or any key used to encrypt OPF resources.

We need a plan on what to do if the master key gets exposed?
What should we do if a sub-key (like a team key encrypting only a subset of the resources) is exposed?
How do we handle git history with all the forks and everything if we need to rotate keys? Files encrypted with old key are still retained in the history.
How do we mass-add new keys and reencrypt all the current resources with a new key?

Write ADR for Log Forwarding pipeline

The work currently being done on adding long term storage to logs, and forwarding logs from CLO to loki (via CLF, Kafka, and vector) is something we should catalogue as a decision record.

ADR for deploying operators

Define what goes into cluster-scope and what does not. Include different viewpoints:

Permissions to deploy certain resource types (Subscription and OperatorGroup)
OperatorGroup namespace scope - cluster-wide vs. single namespace
meta-operator CR permissions

ADR - Resource Quotas Enforcement

With the introduction of tiered quotas in the apps repo: operate-first/apps#439
We are ready to begin enforcing quotas upon namespaces. We should first have an adr around the strategy we would like to employ, and the details of how it will work.

Organize all issues under useful labels for user-story driven analysis

We would like to ensure that all feature/issues are labelled via useful labels that can be then later fed into some visualization tool for further analysis, that will guide our future decisions on which features & user stories to tackle.

Decide on a standardized folder structure for storing ci scripts/tools in our repos.

Currently in some repos we're using hack folder, is this something we want to continue doing or change it?

Create userstory: Cleanup JH PVC

Related to: operate-first/support#172

Advise users on how to diagnose PVC storage usage and identify the usual causes of PVCs being full.

ADR for Data Collection

The Operate First environments will create a vast amount of operational data from platform systems and user workloads.
We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

We need an ADR for different options on how to satisfy this requirement.

What labels to be used to support our workflows on GitHub

SSIA

lets figure out what labels we need for op1st and how they support our workflows. See https://github.com/thoth-station/thoth-application/blob/master/prow/overlays/cnv-prod/labels.yaml and PR against it.

#srcOps
/sig docs

Create User Story: I would like to get my PVC expanded

CPU pressure - Shared resources guide

Create a guide how to be respectful and mindful of shared resources and what CPU request and CPU limit means. We're in a constant state of CPU pressure, while the utilization is never above 20%.

Explain to users what does the CPU request mean and what does CPU limit mean.
Explain to users that it doesn't make any difference if they set CPU request above 1 if their application is not multi-core.
Create a dashboard where the they can check their Kubeflow pipeline/jupyterhub pod actual usage compared to CPU request
Create an alert when the actual usage is way above the actual usage

Create userstory: Convert Kafka runbook steps to features

With the introduction of the kafka runbook, a lot of these steps can be converted to feature files, and linked from the runbook (as we did for Jupyterhub).

ADR required for monitoring setup

We have made some progress with setting up monitoring, and have some issues already created in the appropriate repos for creating service monitors, prometheus / grafana deployments etc. What's lacking is a document that adds context to our setup and future plans. For this we should prepare an ADR that outlines our monitoring architecture.

Specify existing academic environment requirements for access control to operations data

The MOC, NERC and MGHPCC have existing requirements for granting different types of users access to operations data. These are not fully specified in the blueprint or ADRs for the current open issues related to access control. Collect this information and decide how to add it to the design document and relevant open issues.

Create a component overview of 'the Op1st company'

As an outside spectator,
I want to have a look at the website of operate first
so that I can learn which components are deployed and maintained by op1st

As an example, we could have a look at the Fedora Infrastructure wiki

operate-first / blueprint Goto Github PK

blueprint's People

Contributors

Stargazers

Watchers

Forkers

blueprint's Issues

Proposed solution

Use namePreffix or nameSuffix in kustomization.yaml for different sections of argocd-apps

Use Applicaton resource parameters

Recommend Projects

Recommend Topics

Recommend Org

Use `namePreffix` or `nameSuffix` in `kustomization.yaml` for different sections of `argocd-apps`

Use `Applicaton` resource parameters