Giter Club home page Giter Club logo

tag-observability's Introduction

CNCF TAG Observability 🔭⚙️

Technical Advisory Group for Observability under the umbrella of CNCF.

Mission statement

TAG Observability focuses on topics pertaining to the observation of cloud native workloads. Additionally, it produces supporting material and best practices for end-users and provides guidance and coordination for CNCF projects working within the TAG’s scope.

Whitepaper

See our latest 1.0 observability whitepaper to kickstart your observability journey or enhance what you know so far! 💪🏽

Contributions welcome to evolve it further.

Scope

Excerpts from the Observability Charter Document:

  • Foster, review and grow the ecosystem of observability related projects, users, and maintainers in open source, cloud-native technology.
  • Identify and report gaps in the CNCF's project portfolio on topics of observability to the TOC and the wider CNCF community.
  • Collect, curate, champion, and disseminate patterns and current best practices related to the observation of cloud-native systems that are effective and actionable.
  • Educate and inform users with unbiased, accurate, and pertinent information. Educate and help other CNCF projects regarding observability techniques and best current practices available within the CNCF.
  • Provide and maintain a vendor-neutral venue for relevant thought validation, discussion, and project feedback.
  • Provide a ladder for community members to become involved with the technical oversight of projects within the SIG's scope in an open, transparent, and inclusive way.

How we communicate

CNCF projects related to the TAG

projects

Interactive Landscape

How to get involved

There are many ways you can join in to participate in the Observability TAG’s activities.

Great ways to get involved include

If you would like to suggest a specific topic or action item, please determine if there are ongoing activities or prior art. Good starting points are our GitHub Issues, reports, or meeting notes.

If you want to propose new TAG activities or join in for existing ones, please take a look at our Kanban Board or file a suggestion with a GitHub issue :-)

Operations

TOC Liaisons

Name Email GitHub Company
Cathy Zhang [email protected] cathyhongzhang Intel
Erin Boyd [email protected] erinaboyd Red Hat
Ricardo Rocha [email protected] rochaporto CERN.

Chairs (alphabetical order)

Name Email CNCF Slack GitHub Company Open Source
Alolita Sharma [email protected] @Alolita Sharma alolita Apple OpenTelemetry Team
Matt Young [email protected] @Matt Young halcyondude Apple
Richard Hartmann [email protected] @RichiH RichiH Grafana Prometheus Team; PromCon Lead

Tech Leads

Name Email CNCF Slack GitHub Company Open Source
Bartłomiej Płotka [email protected] @bwplotka bwplotka Google Prometheus Team; Thanos Team; Other

Governance

This TAG follows the standard operating model provided by the TOC.

Code of Conduct

We follow the CNCF's Code of Conduct.

tag-observability's People

Contributors

alolita avatar amye avatar arthursens avatar attachmentgenie avatar brettbuddin avatar bwplotka avatar cjyabraham avatar dofinn avatar dominick-blue avatar eminetto avatar eran-medan avatar fribeiro1 avatar halcyondude avatar heidmotron avatar henrikrexed avatar ianonavy avatar kenfinnigan avatar manolama avatar olli73773 avatar pitr avatar prashant3863 avatar programmer04 avatar rexagod avatar riaankl avatar richih avatar robertkielty avatar sinisterlight avatar skolodyazhnyy avatar vjsamuel avatar yang-db avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tag-observability's Issues

Form Program: Annual Sandbox Review

Annual Review for Sandbox projects

Context

Tasks

  • Create Program to facilitate reporting/tracking reports, and making them easy to find.
  • Establish tooling/process such that
    • self-serve semantics are employed. All data --> GitHub, managed via PR.
    • outcome is a tool useful for all TAG's (not just Observability). This could include issue templates, timeline representation
    • all meta-data needs to reside in GitHub as JSON (preferred) or CSV. GitHub Flat actions can help here, as does flatgithub://. This also allows reconstruction later, or pulling actual documents into graph.cncf.io (w/ Lucene full-text indexing)
  • Solicit feedback from TOC and @amye.

Outcome

  • Automated, useful reports are in place and discoverable.
  • Coordinate with the Landscape Graph project (graph.cncf.io)
  • Present to #tag-chairs, #tag-contributor-strategy, #toc with learnings and proposal for all TAG's.
    • Webinar describing the process, how to engage, and why one would.
    • Pitch Deck (1, 1-2, or 1-3 slides --> understanding)

Draft on observability topics/aspects for a white paper

We started discussion about a structure for a document structure (without knowing that issues #16 and #19 had similar proposals) to write a document that outlines some aspects and challenges of observability in cloud native such as:

  • Definition of observability more general, e.g., the three pillars, how to use/apply it and onboard.
  • What project exists in CNCF with which gaps, e.g., landscape and technology radar
  • Migration channels with best practices, e.g., production experiences migrating solutions
  • Issues with old patterns

Question: more topics to discuss.

There are many similar question, such as how multiple storage systems are selected, such as whether the existence of All In One QL is reasonable, such as how to choose and implement tail-base sampling between distributed strong consistency and high availability on the collector side, and so on. (e.g. there)

Maybe we should have a professional list to record these question, which I personally find very valuable and meaningful. 😀

WG: State of Observability (survey)

The goal of this working group is to make an open community/collaborative approach to a survey of users to build a broader polling of users of open source observability solutions. This includes projects in the CNCF and projects outside of that community.

The non-goals are executing how we distribute and what tools we use for the data collection. It could be anything from a simple Google Form to using a service like SurveyMonkey.


https://docs.google.com/document/d/1YPo6fJVTJ8pgqGPk7zdTA6aOtK5oNLg77ys8qdEsxMQ/edit#heading=h.wl6eks62d18g

[Observability Whitepaper] References missing and possible wrongly linked

For example

"In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" [9]

Reference 9 is not present.

Perhaps also it has to be double checked if references are actually correctly referenced.

Community research[4] conducted by ClearPath Strategies and Honeycomb.io

4 is:

BEYER, Betsy; MURPHY, Niall; RENSIN, David; KAWAHARA, Kent; THORNE, Stephen. The Site Reliability Workbook. O'Reilly Media, 2018. Available at: https://sre.google/workbook/table-of-contents/. Accessed on: June 24, 2021.

I don't think (?) that this is the research by ClearPath and Honeycomb.

Idea: CNCF Container registry observability for project maintainers

Suggested at our TAG meeting - CNCF projects currently can see nothing more than total pull counts from today's main registries. Things like tags, runtimes, uniques, etc could be useful to project maintainers but are currently not exposed. It would be great to learn more about what kinds of distribution metrics would be helpful to CNCF maintainers, to get a better sense of how we can best work towards more observable open source container distribution.

Whitepaper on cloud-native observability

Goal is to support users in implementing observability and monitoring for their cloud native workloads

Target: End users building cloud native applications

Scope: Define basic concepts of data collection and analysis and how CNCF projects can be used for this. Maybe add 1 - 3 real world reference examples

Details:

  • Data collection: Logs, Metrics, Traces
    • What to use which data source for
    • Examples with CNCF projects, Prometheus, Jaeger, OpenTelemetry, ...
  • Make your Kubernetes cluster - and the apps running on it - obseravable
  • Storage backends for data
    • Options based on CNCF projects
    • Enterprise requirements: HA, RBAC, ....
  • Data analysis patterns:
    • Alerting, Anomaly detection, trace analytics, log analytics

s/sig/tag for charter

Context: cncf/toc#654

Links are broken in the charter due to the rename of SIGs to TAGs. Also, the default branch in the TOC repo was changed from master to main.

FWIW @amye already made these changes in the TOC repo; they would just need to back-ported here.

Happy to take a first stab if that's welcome! I understand if you don't want randos submitting PRs to your core document 😉

Guidelines for developers on how to implement new metrics

Hello everyone! Currently, I'm a Trainee under the CNCF's Community Bridge program, working for the KubeVirt community.

My project is to improve KubeVirt's observability by implementing new metrics that represent what is going on with the KubeVirt environment. And after I got a little more comfortable with Open Source development, I also started to look for other projects that need help with observability and found Flux.

I've noticed that the first step before doing anything with open source, is to write a proposal so other people can give some comments, thoughts, and suggest changes based on best practices about what is about to be implemented. However, when discussing new metrics, we can't seem to find an agreement about some points, mostly because we cannot find a curated guideline about some issues, which I will present below.

This might be a really extensive issue, but I'd like to provide as much information as possible to present my thoughts 🤔
Hopefully, I won't bore you guys too much 😬

Metrics granularity

Let me give an example. Flux can synchronize Kubernetes manifests in a git repository with a Kubernetes cluster. To do so, when configuring Flux, one needs to inform which git remote repository, which branch, and which directories' path that flux will look up to when doing that synchronization.

Flux has a metric called flux_daemon_sync_manifests with a label success=true|false, which indicates the number of manifests that are being synchronized between repository and cluster and if the sync is successful or not.

When working with a single git repository, this may be enough, but let's say that someone is using flux to sync multiple repositories with a multi-cluster/multi-cloud environment. It will be hard to pinpoint exactly where is the problem when receiving an alert with flux_daemon_sync_manifests{success="false"} > 1

To solve this, we can use some approaches:

  • Tell users to use different Prometheus/Alertmanager servers for each Flux deployment
  • Add new labels to flux metrics that will help to pinpoint problems and set up better alerts.

I think everyone agrees that we should stick with the latter. But how much information is too much?

The most obvious labels are git_repository, git_branch and path, but we could also add some not so obvious ones like manifest which will tell exactly what file failed/succeded the synchronization, and we could also split the metric into flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success and add error to the first one indicating what was the error faced when doing the sync.

Of course, adding new labels always helps us build more detailed monitoring solutions, but they will also need more storage capacity and provoke more performance issues with PromQL(not sure about the last one).

I'm almost sure that a user can drop unnecessary labels from metrics at scrape time, but I don't think that should justify developers to add every label that could be useful to every single use-case.

Differentiate between what should be a metric or a label

I could use the same example as above. Should flux_daemon_sync_manifests be a single metric with the success label? Or should it be split in two; flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success?

But let me give you another example:

KubeVirt is capable of deploying Virtual Machines on top of a Kubernetes Cluster, and KubeVirt exposes metrics regarding the VMs performance and resource usage.

When looking at node exporter's approach with disk metrics, it exposes metrics for read operations and another one for write operations. For example:

  • node_disk_ops_reads - for the total amount of reading operations from a disk device
  • node_disk_ops_written - for the total amount of writing operations in a disk device

KubeVirt's approach, on the other hand, is to expose a single disk metric, but with the label type=read|write.

  • kubevirt_vmi_storage_iops_total - for the total amount of operations in a disk device, has a label to differentiate read and write ops

Both approaches work just fine, but which one is better? Do they have any differences performance-wise?

Every time a developer knows that a given label's value will ALWAYS be within a pre-defined set of values, the developer can choose whether implement several metrics or just a single one with an extra identifying label.

How to work with historical data

To better explain this one, I guess I will have to show you the problem I'm facing with Kubevirt.

As previously said, KubeVirt deploys VMs on top of Kubernetes. KubeVirt can also migrate VMs between nodes, which is necessary when a node becomes slow or unresponsive. Virtual Machine Instance(VMI) and Virtual Machine Instance Migration(VMIM) are implemented as Kubernetes Custom Resource.

VMIM has some useful information like, End and Start timestamps, target node where the VMI is being migrated to, source node where the VMI is being migrated from, which migration method is being used, and they can be in different stages: Succeeded, Running, and Failed.

Every VMI and VMIM that is posted to the K8s API is stored in etcd and as long as the VMI still exists in the cluster, we can retrieve its data and expose them easily. Once we delete a VMI, the VMI and all VMIMs related to it are deleted from the etcd, and then we can't expose information about them anymore.

However, users want to analyze and correlate problems from previous VMI migrations with the existing ones, so they can identify why some migrations are taking more time than others and why they fail or succeed.

Let me try to explain it like this:

  • Each row is a time-series for a VMI migration metric
  • Each column represents 1h in the timeline
  • Let's assume that the Prometheus server was configured to keep metrics in HEAD for 3h before indexing it in the TSDB
  • o represents that the metric was collected in that particular moment of time
  • - represents that the metric was not collected in that particular moment of time
  • [ ] represents where the Prometheus HEAD is pointed to for a particular time series

We can whether:

  • Keep VMIM objects in etcd and always expose old migration metrics
    • Too much disk space/memory required for both metrics storage and etcd
    • Will always keep every migration in-memory, thus, easy to analyze.
/\
| o  o  o  o  o  o  o [o]    #migration 1 UID=1
| o  o  o  o  o  o  o [o]    #migration 2 UID=2
| o  o  o  o  o  o  o [o]    #migration 3 UID=3 (last one for that particular VMI)
------------------------->

Or

  • Remove migrations information of old VMIs from etcd
    • Less storage capacity needed
    • Will have to deal with historical data at the Prometheus' side
/\
| o  o  o  -  -  -  -  -     #migration 1 UID=1
| -  -  -  o  o [o] -  -     #migration 2 UID=2
| -  -  -  -  -  -  o [o]    #migration 3 UID=3 (last one for that particular VMI)
------------------------->

Let's say that I want to create a dashboard with information about all migrations that have ever happened within my cluster. With the first approach, a simple query like this would be enough: kubevirt_vmi_migration_metric_example, since everything is in memory.

Once a time series is removed from Prometheus' HEAD, I will have to work with queries with time ranges, most probably with remote storages like Thanos or InfluxDB as well. The queries' return will not be metric values anymore, but rather metric vectors, which require to be treated differently. It's not an impossible thing to do, but surely must be thought carefully.


I'm sure there are good solutions for everything that I'm bringing with this issue, and I'm also sure that there are several other problems that I couldn't think of right now. What I'm asking for is to have a centralized place with documentation and guidelines for developers who are trying to improve applications' observability.

Perhaps it could be study material for a future Monitoring and Observability Certification, by CNCF. 👀

But anyways, this will help a lot anyone who is writing proposals for Open Source projects or anyone who is trying to follow CNCF's guidelines for Cloud Native Observability.

Whitepaper: Add section about instrumentation

It would epic to clarify what's manual vs auto instrumentation are. This also means explaining Prometheus exporter pattern, mentioning semi-manual instrumentation (libraries auto instrumenting functionality) etc.

Draft proposal for initial set of Working Groups

(from the TOC TAG document)

TAGs may choose to spawn focussed and time-limited working groups to achieve some of their responsibilities (for example, to produce a specific educational white paper, or portfolio gap analysis report). Working groups should have a clearly documented charter, timeline (typically a few quarters at most), and set of deliverables. Once the timeline has elapsed, or the deliverables delivered, the working group dissolves, or is explicitly re-chartered.

We need a clear and approachable process for community members (and those that are considering joining) to propose and potentially form a new Working Group.

To jumpstart and facilitate this:

  • Identify and Enumerate potential WG's (the charter has a bunch to start with)
  • Propose a basic process we'll use to evaluate and create new WG's that's simple and clear
  • Create an artifact (.md in this repo) that can be shared with community capturing this.

Create presentation slide deck. (1-3 slides)

This will be a 1-3 slide presentation representative of the initial thoughts and focus of this project. It will serve as a basis to solicit feedback from TAG-Chairs, TOC, and the TAG-Contributor-Strategy team, as well as all attendees of the presentation meeting.

WG: Foster and encourage in-person observability-focused meetups.

  • Brainstorm: What makes a fun meetup? What can we do to help?
  • create materials (slides, tips, templates) to lower the barrier to forming these groups
  • Create a registry/list of meetups
  • Propose an actionable plan to jumpstart a few, solicit TAG member volunteers.

TAG Observability Logo

We need a logo! Folks at the TOC will complete a number of workflow steps once we have our logo created and in place.

  • Choose a logo
  • Update readme, landscape graphic, and other places with official logo
  • Write up "why owls"

WG: Observability Personas

Proposed personas to get us started:

  • App / service developers
  • Operators
  • Vendors
  • Projects
  • Cloud Providers
  • Projects & Vendors not yet in CNCF
  • TOC

For each, identify some number of specific examples and reach out to them. Listen to their challenges and wins. Our Personas should be derived from primary sources, should be vetted, and the description of their needs, challenges, and expectations for their engagement with us.

Contribute to adding profiling as OTEL supported event type

As profiling has grown in popularity and more companies have started to build tooling around profiling, there has been a lot of discussion about adding profiling as an official OTEL supported event type.

Logs, metrics, and traces have already gone through a similar process to become more standardized and seeing as we have identified profiling as an "increasingly important" observability signal in our whitepaper it would be great for the observability tag to help provide input / resources towards this effort.

Several interested groups from a profiling-developer slack channel, the OTEL issue linked above, and some who discussed this live at Kubecon EU this year have started to plan for making progress on this and I've compiled some general thoughts about possible next steps in a doc here.

I'd love to discuss this as a group at the next meeting and also to gather thoughts, ideas, opinions, etc here as well!

cncf/landscape graph - kickoff!

image

Welcome!

This issue is to provide a way for members of the TAG Observability community to express interest in contributing to the https://github.com/cncf/landscape-graph project.

ACTION: Please leave a comment or the emoticon of your choice on this issue.

We'll target our project kick-off meeting 2/27 - 3/3 (doodle will be distributed, meeting(s) recorded).


The Landscape Graph project is looking for YOU! We are seeking folks who are interested in contributing to the project. We welcome anyone who want's to collaborate in the open. All of the following disciplines might find interesting challenges and the potential for novel work.

  • Project & Program Managers
  • Data Scientists, with a focus on Graph Data Science, AdTech, FinTech, and social networks.
  • UX / Design professionals
  • Financial Analysts
  • developers (front end, back end, UI, k8s)
    • (any of, not all) Typescript, Python, Bash, Java, Cypher, SQL, GraphQL, ...
    • Kubernetes, Cloud Infrastructure, ...
  • Creative Content (technical writing, art, design)
  • anyone with a passion for Open Source Ecosystems, Communities, and assessments of project health and composition.
  • Those wishing to work in a tech stack that's wildly useful for Observability Solutions.

Details

Why does this project exist?

Graphs can facilitate rich analysis of our vibrant and dynamic communities, the humans they comprise, and the clusters of contribution and thought leadership they produce.

Often, we need to understand how an open source project interacts with others, how it's changing over time, and who's enabling it's continued success. We want to understand what alternatives exist, or how complementary projects might be combined in purpose-fit or novel ways. We might want to dive in and contribute! This is how projects and ecosystems grow to meet business challenges facing modern organizations.

What's the overall approach at the outset of the project?

Using the data underlying the existing landscape as input, a Labeled Property Graph (LPG) is constructed using Cypher (SQL for Graphs), resulting in knowledge graph.

Landscape Graph Core Data Model (initial)

image

WG Content

Let's create one; we will see over time who is interested in doing what.

Some common roles (not a prescription of what needs to be done or how):

  • YT/Twitter/shepherd
  • Content creation
  • Content pipeline (new speakers, etc)

Observability knowledge entrypoint/index page.

As per today's SIG CNCF meeting, it would be awesome if we could give an entry point to the observability projects we represent.

Overall the goal would be to improve the user world, especially new joiners. See #18 for example questions new users have when entering the observability world, this time from metrics angle.

Acceptance Criteria:

  • Friendly starting point, how to start learning about observability for cloud-native apps
  • Instead of repeating, link to project's docs, (some nice search as well?)
  • Establish a feedback channel in case of missing docs/answers with observability projects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.