operate-first / blueprint Goto Github PK
View Code? Open in Web Editor NEWThis is the blueprint for the Operate First Initiative
License: GNU General Public License v3.0
This is the blueprint for the Operate First Initiative
License: GNU General Public License v3.0
@tumido, @4n4nd , and I had a brief discussion about what we currently think users/folks generally know when looking to bring their projects to operate-first environments, and what we would like them to know more about right off the bat.
Currently we believe most users know the following when first looking to gain access to operate-first environment/clusters:
Some folks might no more, or even less, but we think that's basically what they know on average based on the types of issues/questions we encounter. If we use this as a baseline, then we can improve how quickly we can ramp up newcomers on all the key pieces of information to help them look in the right places & make the right decisions. The first step is to identify what these key pieces of information are, we believe some of these to be:
Goal of this issue is to identify more key pieces of informations in this thread. Then we can have a follow up thread on how to better organize our communication/docs to better market this information. Identifying the first pieces of information we would like people to see can help with how we can improve the website as well.
We never properly discussed what github teams we should have, and who should be in which team, and which team needs what permissions. Currently we have an OPs team where I just threw everyone into in the beginning and gave them all admin to every repo. This was to just get us going, but now I think we could use a bit more structure (especially since using github to do auth for our apps is still yet a possibility and not off the table). This is the current set up:
Let's clean this up and re-organize. So we need:
Hi Folks,
Humair and I were talking earlier today about highlighting visibility into the collaboration we are having between the IDH, Operate First, and ODH teams. I think the engineers on the projects can generally agree that we have a good amount of shared development efforts and knowledge sharing but it is hard to quantifiably state that. Does anyone here have any ideas on improving visibility on this?
It'd be cool if there was some existing tool that lets you see interactions between organizations on Github, but a cursory google search didn't seem to surface anything.
In operate-first/sre#19, we looked at configuring the GitHub alert manager for handling Prometheus alerts. We should create an ADR that outlines our alerting setup.
The Operate First environments will create a vast amount of operational data from platform systems and user workloads.
We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.
We need an ADR for different options on how to satisfy this requirement.
Hey!
More and more open source projects are changing their default branches to be main
instead of master
. New repositories on Github already default to main, and Gitlab will soon be doing the same. Should we consider changing the defaults on operate-first repositories as well?
PR checks keep failing because the ci files do not exist in this repo, and all markdown files are not properly linted due to this.
After discovery of thoth-station/thoth-application#1215 it seems to me we need a better guidance on how to use approve
and lgtm
and hold
and what is expected to happen after either is applied. I'll sum it up into an ADR doc.
As per our meeting discussions, I think it would be helpful if we considered creating user personas and user stories to gain better insight into the types of users we intend to accommodate. We already have a general understanding of our goals and missions, but this could help organize and categorize those needs onto several archetypes and help drive our decision making.
Given that our initiative is very contingent upon users using our tooling/docs/repos, this sort of user centered approach could be very beneficial for us.
I can identify at least 2 areas where could potentially benefit from having user personas/stories (you might ave more to add, so definitely open to suggestions):
For both of these, I think we can have multiple personas. I was thinking a brief user story for each persona. The user story should be brief, 1-3 lines, and should document practical needs/motivations of the users. Ideally it should also help identify a roadmap for us.
I would imagine we don't want to go over board here and create 50 different personas, which would not be helpful. So maybe we should try to limit them to about 5-10.
So having said all that, I would like to ask you guys:
If so then:
The point of this discussion is to explore this type of user-centered design, gauge everyone's thoughts, and if we're all on board, create new issues to write such personas/userstories.
Create a guide how to be respectful and mindful of shared resources and what CPU request and CPU limit means. We're in a constant state of CPU pressure, while the utilization is never above 20%.
1
if their application is not multi-core.We would like features to be in one location, either in one file or grouped under one directory to tell one full story for feature as opposed to organizing by personas.
The MOC, NERC and MGHPCC have existing requirements for granting different types of users access to operations data. These are not fully specified in the blueprint or ADRs for the current open issues related to access control. Collect this information and decide how to add it to the design document and relevant open issues.
Let's discuss how do we plan to handle if we lose the Operate First or any key used to encrypt OPF resources.
I'm seeing little to no difference between crc/quicklab environments, and at this point I wonder if it's worth constantly adding a new directory for both when most of the time they end up being the same (referencing base). Would it not be better to just have a "dev" folder instead? This would reduce a lot of unnecessary clutter.
If there is a significant between the two (in terms of how/what we deploy), feel free to mention it here -- this is simply what I've noticed π
copied from a mail by @tumido
We're working with various direct and indirect users:
we have to define how users can opt-in or -out for the collection of their data.
it's similar to the operational data defined in #80
The work currently being done on adding long term storage to logs, and forwarding logs from CLO to loki (via CLF, Kafka, and vector) is something we should catalogue as a decision record.
SSIA
We can start with "Operations" and "Data scientist" personas directories. The purpose of these directories will be to store the feature files that belong to each persona accordingly.
Plan on transition from MOC SSO and dex (as a lightweight OpenID Connect bridge) to a Keycloack and do IDM ourselves.
We would like to outphase MOC SSO as the only identity provider and replace it by Keycloak with Google/Github/other identity aggregators and connectors. This way we can have broader audience. MOC SSO can remain as an option - we would connect to it via Keycloak.
Sections "ArgoCD App of Apps" needs to be rethought since ApplicationSets aims to solve this "bootstrapping" issue. Also, working with ApplicationSets will give you a new outlook on directory structure.
Will be happy to work with anyone to get this together.
We can start with Jupyterhub/Datascientist persona and Operations.
With the introduction of the kafka runbook, a lot of these steps can be converted to feature files, and linked from the runbook (as we did for Jupyterhub).
Define what goes into cluster-scope
and what does not. Include different viewpoints:
Subscription
and OperatorGroup
)OperatorGroup
namespace scope - cluster-wide vs. single namespaceRelated: here
Aggregating many application across multiple repositories and deploying to multiple clusters increases a risk of an application being named the same. On the other hand the nature of the OCP platform itself limits us to unique application names only, because all the Application
resources land in the same namespace.
The situation is even more unfortunate in cases when different app-of-apps would try syncing different application specs with the same name. ArgoCD would end up with 2 competing apps syncing "the same" resource.
The possibility of this happening grows with the amount of clusters and teams onboarded.
namePreffix
or nameSuffix
in kustomization.yaml
for different sections of argocd-apps
Similar to this PR: operate-first/argocd-apps#101
This works only for well behaving apps, since it's part of the manifests
Doesn't work for conflicts between app-of-apps
- 2 different app of apps can still apply a resource with the same name
Applicaton
resource parametersWorks exceptionally well for app-of-apps, since it operates on app of apps resource spec level: always makes all applications deployed via an app of apps to follow the naming scheme.
Independent from the manifests deployed by app of apps
https://argoproj.github.io/argo-cd/user-guide/kustomize/#kustomize
Result of a discussion on this issue will be submitted as an ADR.
Describe how we choose Rook as the ceph/s3 provider
Related to operate-first/support#48
Currently in some repos we're using hack
folder, is this something we want to continue doing or change it?
We would like to ensure that all feature/issues are labelled via useful labels that can be then later fed into some visualization tool for further analysis, that will guide our future decisions on which features & user stories to tackle.
With recent works on:
We should frame the auth architecture used.
We want to publish all collected data under an open source license.
Decide which one to use.
related issue #76
We'll have environments that are connected to universities, as they're hosted in a university data center. Here the university superimposes legal requirements.
But we'll also have environments that are located in EMEA on-premise datacenters, such as the Hetzner setup.
How can we create an environment that is as open as possible?
One of the key components of a true public Operate First capability for open source communities is going to be the legal framework that governs the data.
We don't want to constrain how the data is accessed, i.e. people with access to the data might implicitly accept an agreement, best we want to make it available to the world so that no authentication is in place and even a web-search-crawler could access it.
The technical implementation is straightforward. But we have to stay within legal requirements, like GDPR.
The legal question would be: what are the agreements/licenses/etc we need to put in place and what is the nature of the data we're allowed to publish
The definition of operational data is tracked in #80
As a Data Catalog user, I would like to know how to configure Ceph with Hue. This would help with our visualization needs where we need to pull data from our Ceph buckets and create tables for it in Hue (which are later connected in Superset to create dashboards).
Revisit monitoring ADR - define that:
Related to: operate-first/support#172
Advise users on how to diagnose PVC storage usage and identify the usual causes of PVCs being full.
With the introduction of tiered quotas in the apps repo: operate-first/apps#439
We are ready to begin enforcing quotas upon namespaces. We should first have an adr around the strategy we would like to employ, and the details of how it will work.
As an outside spectator,
I want to have a look at the website of operate first
so that I can learn which components are deployed and maintained by op1st
As an example, we could have a look at the Fedora Infrastructure wiki
We had discussions during sprint planning for having a more defined broader goal (or set of goals) when it comes to e2e testing. The purpose of this is to serve as a sort of way to gauge the more granular steps/issues/changes we will be applying and ensuring that they are in line with our broader goals.
I think we can consider this issue closed if we can answer some of the question:
What does the ideal e2e scenario look like for operate-first? (e.g. stage/prod clusters, multi step deployments, schema validations, etc.)
Please add your thoughts and other questions that we should be looking to answer.
SSIA
lets figure out what labels we need for op1st and how they support our workflows. See https://github.com/thoth-station/thoth-application/blob/master/prow/overlays/cnv-prod/labels.yaml and PR against it.
#srcOps
/sig docs
A common failure scenario for JupyterHub users is notebook server PVCs running out of storage capacity. Ideally we can update the ODH operator to proactively increase PVC sizes when it sees that a user is close to running out of space. Since @HumairAK and @4n4nd did a POC on this some time ago it'd be great if we could leverage your experience!
Discuss ideas for grafana dashboard collaboration (opf, RHODS, upstream ODH). We need to have a single source of truth for the grafana dashboards being developed across our teams.
The question is, should we enforce ---
document starts on YAML files or not?
This appeared here: operate-first/apps#628 (comment)
Attempt to solve this: operate-first/apps#639
I think this deserves a proper issue, rather than a few chat messages.
For enforcing:
cat
multiple files if needed and docs are properly separatedAgainst enforcing:
Note: This change doesn't affect the apps
repo only. It should be enforced equally on all repos across this organization.
We have made some progress with setting up monitoring, and have some issues already created in the appropriate repos for creating service monitors, prometheus / grafana deployments etc. What's lacking is a document that adds context to our setup and future plans. For this we should prepare an ADR that outlines our monitoring architecture.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.