Giter Club home page Giter Club logo

blueprint's People

Contributors

durandom avatar goern avatar harshad16 avatar hemajv avatar humairak avatar ipolonsk avatar margarethaley avatar michaelclifford avatar schwesig avatar tumido avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blueprint's Issues

Visibility on Collaboration

Hi Folks,

Humair and I were talking earlier today about highlighting visibility into the collaboration we are having between the IDH, Operate First, and ODH teams. I think the engineers on the projects can generally agree that we have a good amount of shared development efforts and knowledge sharing but it is hard to quantifiably state that. Does anyone here have any ideas on improving visibility on this?

It'd be cool if there was some existing tool that lets you see interactions between organizations on Github, but a cursory google search didn't seem to surface anything.

cc @HumairAK @tumido @4n4nd @durandom @vpavlin @accorvin

ADR MOC SSO -> Keycloack

Plan on transition from MOC SSO and dex (as a lightweight OpenID Connect bridge) to a Keycloack and do IDM ourselves.

We would like to outphase MOC SSO as the only identity provider and replace it by Keycloak with Google/Github/other identity aggregators and connectors. This way we can have broader audience. MOC SSO can remain as an option - we would connect to it via Keycloak.

Discuss broader end goals for CI and e2e testing for the operate-first initiative.

We had discussions during sprint planning for having a more defined broader goal (or set of goals) when it comes to e2e testing. The purpose of this is to serve as a sort of way to gauge the more granular steps/issues/changes we will be applying and ensuring that they are in line with our broader goals.

I think we can consider this issue closed if we can answer some of the question:

What does the ideal e2e scenario look like for operate-first? (e.g. stage/prod clusters, multi step deployments, schema validations, etc.)

Please add your thoughts and other questions that we should be looking to answer.

ADR for data license

We want to publish all collected data under an open source license.
Decide which one to use.

related issue #76

ADR for EULA

We need a EULA for accessing the environment

We don't want consumers of the operational data sign a EULA for accessing the data, but the data should be published under an open source license

related PRs #75 #74

Taking suggestions on github teams / team permissions for the operate-first organization

We never properly discussed what github teams we should have, and who should be in which team, and which team needs what permissions. Currently we have an OPs team where I just threw everyone into in the beginning and gave them all admin to every repo. This was to just get us going, but now I think we could use a bit more structure (especially since using github to do auth for our apps is still yet a possibility and not off the table). This is the current set up:

image

Let's clean this up and re-organize. So we need:

  1. list of teams, who should be in each team
  2. team permissions per repo

RFE ApplicationSets and Directory Structre

Sections "ArgoCD App of Apps" needs to be rethought since ApplicationSets aims to solve this "bootstrapping" issue. Also, working with ApplicationSets will give you a new outlook on directory structure.

Will be happy to work with anyone to get this together.

Create userstory: I would like to configure Ceph in Hue

As a Data Catalog user, I would like to know how to configure Ceph with Hue. This would help with our visualization needs where we need to pull data from our Ceph buckets and create tables for it in Hue (which are later connected in Superset to create dashboards).

Identifying and improving user expectations

@tumido, @4n4nd , and I had a brief discussion about what we currently think users/folks generally know when looking to bring their projects to operate-first environments, and what we would like them to know more about right off the bat.

Currently we believe most users know the following when first looking to gain access to operate-first environment/clusters:

  • Some general Openshift/k8s knowledge
  • Operate-first has 1 or more Openshift clusters where people can deploy their projects to
  • Possibly a link to the website & support repo

Some folks might no more, or even less, but we think that's basically what they know on average based on the types of issues/questions we encounter. If we use this as a baseline, then we can improve how quickly we can ramp up newcomers on all the key pieces of information to help them look in the right places & make the right decisions. The first step is to identify what these key pieces of information are, we believe some of these to be:

  1. Links to onboarding templates
  2. How to get a namespace on one of our managed clusters, with project-admin access for individuals and teams
  3. We manage various services for the end-user (and have an exhaustive list of these services e.g. Jupyterhub, Kafka, etc.)
  4. End-user should use our Argocd (we should also promote Argocd more)
  5. We don’t allow cluster resources to be hosted in a repo outside this org. They must deploy cluster resources via a PR to the apps/cluster-scope location.
  6. They will need to work with and understand the basics of kustomize to work with our repos
  7. Toolbox is a thing to make it easier for newcomers to gain the required tooling

Goal of this issue is to identify more key pieces of informations in this thread. Then we can have a follow up thread on how to better organize our communication/docs to better market this information. Identifying the first pieces of information we would like people to see can help with how we can improve the website as well.

ADR for defintion and collection of workload data

copied from a mail by @tumido

We're working with various direct and indirect users:

  1. Academic users doing research on the data
  2. Engineers and operators working directly with the platform, deploying their workloads and consuming the data, their applications have their own users, described later as indirect users
  3. Workshop users who are invited to the platform usually by engineers to "try something out", deploying applications or what not
  4. Indirect consumers are users accessing applications deployed on top of the platform or are just presented with the data, not being aware that it originated from this platform.
  • This also includes users consuming our CI, since they can view logs hosted on our platform. These logs are results of CI job runs against their own code, though technically the logs are produced within our platform and hosted there. If anything goes wrong in the CI run, they are usually presented with snippets of what went wrong. Does this also require DUA?
  • Accessing hosted applications that allow users to browse or work on data within and/or outside of the platform.
  • Interfacing with APIs that are indirectly linked (cnamed) to our platforms providing recommendations etc (Project Thoth)
  • Interacting with our chat-ops in Github and Slack which is still an interaction with our platform

we have to define how users can opt-in or -out for the collection of their data.
it's similar to the operational data defined in #80

Change default branches to main

Hey!

More and more open source projects are changing their default branches to be main instead of master. New repositories on Github already default to main, and Gitlab will soon be doing the same. Should we consider changing the defaults on operate-first repositories as well?

[yamllint] Document start settings

The question is, should we enforce --- document starts on YAML files or not?

This appeared here: operate-first/apps#628 (comment)
Attempt to solve this: operate-first/apps#639

I think this deserves a proper issue, rather than a few chat messages.

For enforcing:

  • Part of the standard
  • Less opinionated yamllint config (using default values)
  • We can cat multiple files if needed and docs are properly separated
  • Single YAML doc files are consistent with multidoc files

Against enforcing:

  • It's just clutter in most cases
  • Too verbose for single doc YAML files

Note: This change doesn't affect the apps repo only. It should be enforced equally on all repos across this organization.

Concern - ArgoCD application name conflicts

Aggregating many application across multiple repositories and deploying to multiple clusters increases a risk of an application being named the same. On the other hand the nature of the OCP platform itself limits us to unique application names only, because all the Application resources land in the same namespace.

The situation is even more unfortunate in cases when different app-of-apps would try syncing different application specs with the same name. ArgoCD would end up with 2 competing apps syncing "the same" resource.

The possibility of this happening grows with the amount of clusters and teams onboarded.

Proposed solution

Use namePreffix or nameSuffix in kustomization.yaml for different sections of argocd-apps

Similar to this PR: operate-first/argocd-apps#101

This works only for well behaving apps, since it's part of the manifests
Doesn't work for conflicts between app-of-apps - 2 different app of apps can still apply a resource with the same name

Use Applicaton resource parameters

Works exceptionally well for app-of-apps, since it operates on app of apps resource spec level: always makes all applications deployed via an app of apps to follow the naming scheme.
Independent from the manifests deployed by app of apps

https://argoproj.github.io/argo-cd/user-guide/kustomize/#kustomize

Result of a discussion on this issue will be submitted as an ADR.

Personas & User stories

As per our meeting discussions, I think it would be helpful if we considered creating user personas and user stories to gain better insight into the types of users we intend to accommodate. We already have a general understanding of our goals and missions, but this could help organize and categorize those needs onto several archetypes and help drive our decision making.

Given that our initiative is very contingent upon users using our tooling/docs/repos, this sort of user centered approach could be very beneficial for us.

I can identify at least 2 areas where could potentially benefit from having user personas/stories (you might ave more to add, so definitely open to suggestions):

  • Users of MOC
  • Users looking to emulate our setup

For both of these, I think we can have multiple personas. I was thinking a brief user story for each persona. The user story should be brief, 1-3 lines, and should document practical needs/motivations of the users. Ideally it should also help identify a roadmap for us.

I would imagine we don't want to go over board here and create 50 different personas, which would not be helpful. So maybe we should try to limit them to about 5-10.

So having said all that, I would like to ask you guys:

  • Do you think user personas / stories will be beneficial here? Would you find it useful?

If so then:

  • What would you like to see in such personas/stories?
  • Do you have a specific type of user/goal/need you want captured in such personas/stories?
  • Any thing else you would like to add?

The point of this discussion is to explore this type of user-centered design, gauge everyone's thoughts, and if we're all on board, create new issues to write such personas/userstories.

Revisit ADR 0006

Revisit monitoring ADR - define that:

  1. We use 2 parallel monitoring solutions
  2. We fully embrace user workload monitoring
  3. We deploy a separate thanos/observatorium solution for comparison

Collaborate on Jupyterhub Notebook PVC Resizing

A common failure scenario for JupyterHub users is notebook server PVCs running out of storage capacity. Ideally we can update the ODH operator to proactively increase PVC sizes when it sees that a user is close to running out of space. Since @HumairAK and @4n4nd did a POC on this some time ago it'd be great if we could leverage your experience!

Definition of the most open environment

We'll have environments that are connected to universities, as they're hosted in a university data center. Here the university superimposes legal requirements.

But we'll also have environments that are located in EMEA on-premise datacenters, such as the Hetzner setup.
How can we create an environment that is as open as possible?

One of the key components of a true public Operate First capability for open source communities is going to be the legal framework that governs the data.

  1. grant public read-only access to all platform- and (potentially) workload configuration (excl. secrets) - think browsing open-source code
  2. grant public read-only access to all platform- and (potentially) workload operational data (logs, metrics, issues) without the need to scrub the data - think browsing open-source code
  3. grant authenticated read-write access to the environment with the least legal hurdles - e.g. by accessing you're accepting a EULA implicitly - think deploying workloads

We don't want to constrain how the data is accessed, i.e. people with access to the data might implicitly accept an agreement, best we want to make it available to the world so that no authentication is in place and even a web-search-crawler could access it.
The technical implementation is straightforward. But we have to stay within legal requirements, like GDPR.

The legal question would be: what are the agreements/licenses/etc we need to put in place and what is the nature of the data we're allowed to publish

The definition of operational data is tracked in #80

Separate Features into "Persona Directories"

We can start with "Operations" and "Data scientist" personas directories. The purpose of these directories will be to store the feature files that belong to each persona accordingly.

CRC vs Quicklab deployments

I'm seeing little to no difference between crc/quicklab environments, and at this point I wonder if it's worth constantly adding a new directory for both when most of the time they end up being the same (referencing base). Would it not be better to just have a "dev" folder instead? This would reduce a lot of unnecessary clutter.

If there is a significant between the two (in terms of how/what we deploy), feel free to mention it here -- this is simply what I've noticed πŸ˜ƒ

Re-keying secrets if a private key gets lost

Let's discuss how do we plan to handle if we lose the Operate First or any key used to encrypt OPF resources.

  • We need a plan on what to do if the master key gets exposed?
  • What should we do if a sub-key (like a team key encrypting only a subset of the resources) is exposed?
  • How do we handle git history with all the forks and everything if we need to rotate keys? Files encrypted with old key are still retained in the history.
  • How do we mass-add new keys and reencrypt all the current resources with a new key?

Write ADR for Log Forwarding pipeline

The work currently being done on adding long term storage to logs, and forwarding logs from CLO to loki (via CLF, Kafka, and vector) is something we should catalogue as a decision record.

ADR for deploying operators

Define what goes into cluster-scope and what does not. Include different viewpoints:

  • Permissions to deploy certain resource types (Subscription and OperatorGroup)
  • OperatorGroup namespace scope - cluster-wide vs. single namespace
  • meta-operator CR permissions

ADR - Resource Quotas Enforcement

With the introduction of tiered quotas in the apps repo: operate-first/apps#439
We are ready to begin enforcing quotas upon namespaces. We should first have an adr around the strategy we would like to employ, and the details of how it will work.

ADR for Data Collection

The Operate First environments will create a vast amount of operational data from platform systems and user workloads.
We want to publish the data under a license agreement that is similar to an open source license agreement.
We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

We need an ADR for different options on how to satisfy this requirement.

CPU pressure - Shared resources guide

Create a guide how to be respectful and mindful of shared resources and what CPU request and CPU limit means. We're in a constant state of CPU pressure, while the utilization is never above 20%.

  • Explain to users what does the CPU request mean and what does CPU limit mean.
  • Explain to users that it doesn't make any difference if they set CPU request above 1 if their application is not multi-core.
  • Create a dashboard where the they can check their Kubeflow pipeline/jupyterhub pod actual usage compared to CPU request
  • Create an alert when the actual usage is way above the actual usage

ADR required for monitoring setup

We have made some progress with setting up monitoring, and have some issues already created in the appropriate repos for creating service monitors, prometheus / grafana deployments etc. What's lacking is a document that adds context to our setup and future plans. For this we should prepare an ADR that outlines our monitoring architecture.

Create a component overview of 'the Op1st company'

As an outside spectator,
I want to have a look at the website of operate first
so that I can learn which components are deployed and maintained by op1st

As an example, we could have a look at the Fedora Infrastructure wiki

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.