cncf / tag-observability Goto Github PK

View Code? Open in Web Editor NEW

590.0 81.0 78.0 9.47 MB

Technical Advisory Group for Observability 🔭⚙️

Home Page: https://cncf.io/projects

License: Apache License 2.0

SCSS 33.96% HTML 52.52% JavaScript 10.07% CSS 2.60% Shell 0.49% Makefile 0.36%

analysis collection events logs metrics monitoring observability profiling telemetry traces

tag-observability's Introduction

CNCF TAG Observability 🔭⚙️

Technical Advisory Group for Observability under the umbrella of CNCF.

Mission statement

TAG Observability focuses on topics pertaining to the observation of cloud native workloads. Additionally, it produces supporting material and best practices for end-users and provides guidance and coordination for CNCF projects working within the TAG’s scope.

Whitepaper

See our latest 1.0 observability whitepaper to kickstart your observability journey or enhance what you know so far! 💪🏽

Contributions welcome to evolve it further.

Scope

Excerpts from the Observability Charter Document:

Foster, review and grow the ecosystem of observability related projects, users, and maintainers in open source, cloud-native technology.
Identify and report gaps in the CNCF's project portfolio on topics of observability to the TOC and the wider CNCF community.
Collect, curate, champion, and disseminate patterns and current best practices related to the observation of cloud-native systems that are effective and actionable.
Educate and inform users with unbiased, accurate, and pertinent information. Educate and help other CNCF projects regarding observability techniques and best current practices available within the CNCF.
Provide and maintain a vendor-neutral venue for relevant thought validation, discussion, and project feedback.
Provide a ladder for community members to become involved with the technical oversight of projects within the SIG's scope in an open, transparent, and inclusive way.

How we communicate

Chat: CNCF Slack #tag-observability
TAG meetings @ 1600 UTC, 1st and 3rd Tuesdays of each month
- See CNCF community calendar for invite links
- Meeting Notes
In Progress Work, Status, and Roadmap
Mailing List: [email protected] (join here)

CNCF projects related to the TAG

Interactive Landscape

How to get involved

There are many ways you can join in to participate in the Observability TAG’s activities.

Great ways to get involved include

Join our discussion channels on CNCF Slack #tag-observability
Join our bi-monthly calls where observability experts and practitioners present on and discuss Observability topics
Contribute to discussion threads on the O11y TAG GitHub Issues
Participate in O11y TAG workgroups including
- Observe K8s workgroup
- Observability Query Language Specification (QLS) workgroup
Come say Hi! at CNCF conferences including Kubecon NA, Kubecon EU and other open source conferences such as Open Source Summits.

If you would like to suggest a specific topic or action item, please determine if there are ongoing activities or prior art. Good starting points are our GitHub Issues, reports, or meeting notes.

If you want to propose new TAG activities or join in for existing ones, please take a look at our Kanban Board or file a suggestion with a GitHub issue :-)

Operations

TOC Liaisons

Name	Email	GitHub	Company
Cathy Zhang	[email protected]	cathyhongzhang	Intel
Erin Boyd	[email protected]	erinaboyd	Red Hat
Ricardo Rocha	[email protected]	rochaporto	CERN.

Chairs (alphabetical order)

Name	Email	CNCF Slack	GitHub	Company	Open Source
Alolita Sharma	[email protected]	@Alolita Sharma	alolita	Apple	OpenTelemetry Team
Matt Young	[email protected]	@Matt Young	halcyondude	Apple
Richard Hartmann	[email protected]	@RichiH	RichiH	Grafana	Prometheus Team; PromCon Lead

Tech Leads

Name	Email	CNCF Slack	GitHub	Company	Open Source
Bartłomiej Płotka	[email protected]	@bwplotka	bwplotka	Google	Prometheus Team; Thanos Team; Other

Governance

This TAG follows the standard operating model provided by the TOC.

Code of Conduct

We follow the CNCF's Code of Conduct.

tag-observability's People

Contributors

Stargazers

Watchers

tag-observability's Issues

Form Program: Annual Sandbox Review

Annual Review for Sandbox projects

Context

Tasks

Create Program to facilitate reporting/tracking reports, and making them easy to find.
Establish tooling/process such that
- self-serve semantics are employed. All data --> GitHub, managed via PR.
- outcome is a tool useful for all TAG's (not just Observability). This could include issue templates, timeline representation
- all meta-data needs to reside in GitHub as JSON (preferred) or CSV. GitHub Flat actions can help here, as does flatgithub://. This also allows reconstruction later, or pulling actual documents into graph.cncf.io (w/ Lucene full-text indexing)
Solicit feedback from TOC and @amye.

Outcome

Automated, useful reports are in place and discoverable.
Coordinate with the Landscape Graph project (graph.cncf.io)
Present to #tag-chairs, #tag-contributor-strategy, #toc with learnings and proposal for all TAG's.
- Webinar describing the process, how to engage, and why one would.
- Pitch Deck (1, 1-2, or 1-3 slides --> understanding)

Draft on observability topics/aspects for a white paper

We started discussion about a structure for a document structure (without knowing that issues #16 and #19 had similar proposals) to write a document that outlines some aspects and challenges of observability in cloud native such as:

Definition of observability more general, e.g., the three pillars, how to use/apply it and onboard.
What project exists in CNCF with which gaps, e.g., landscape and technology radar
Migration channels with best practices, e.g., production experiences migrating solutions
Issues with old patterns

Collaborate with TOC to clarify and publish TAG leadership role requirements for O11y TAG leaders

The TAG leadership role, responsibilities and participation requirements need to be clearly defined in the CNCF TAG charter doc.

Action: The TAG leadership will work with the TOC to clarify and publish these requirements.

Establish meeting times, cadence, schedule for SIG Meetings

Idea: Clarify and demonstrate correlations better

The idea that came from our TAG meeting community. Something to guide community on. Good practices, how far you can do with the current tooling.

Prepare Incubation recommendation / assessment for the Cortex project.

Looks like as part of Graduation process we might be responsible for Technical Deep Dive Overview for SIG recommendation:

(https://github.com/cncf/toc/blob/master/process/incubation-process.png)

If I am not wrong, Cortex is the first one in the queue: cncf/toc#315 (comment) 🤗

Queue on TOC project: https://github.com/cncf/toc/pulls?q=is%3Apr+is%3Aopen+label%3Asig-observability+

Question: more topics to discuss.

There are many similar question, such as how multiple storage systems are selected, such as whether the existence of All In One QL is reasonable, such as how to choose and implement tail-base sampling between distributed strong consistency and high availability on the collector side, and so on. (e.g. there)

Maybe we should have a professional list to record these question, which I personally find very valuable and meaningful. 😀

How to obtain the white paper on observability.

How to obtain the white paper on observability. Is it possible to give me a web link? Hope this can be done. Thanks a lot.

WG: State of Observability (survey)

The goal of this working group is to make an open community/collaborative approach to a survey of users to build a broader polling of users of open source observability solutions. This includes projects in the CNCF and projects outside of that community.

The non-goals are executing how we distribute and what tools we use for the data collection. It could be anything from a simple Google Form to using a service like SurveyMonkey.

https://docs.google.com/document/d/1YPo6fJVTJ8pgqGPk7zdTA6aOtK5oNLg77ys8qdEsxMQ/edit#heading=h.wl6eks62d18g

Request better categorisation to have up to date list of Projects in scope for SIG Observability.

Tracking ticket to improve https://landscape.cncf.io/ to add some SIG-based label (:

cc @caniszczyk @idvoretskyi 🤗

Will add JIRA issue for this.

Update TOC Liaison names for tag-observability

Update https://github.com/cncf/tag-observability/

Cornelia Davis (@cdavisafc) is no longer on the current TOC.
Lei Zhang (resouer) doesn't interact with O-TAG afaik.

Current TOC members: https://github.com/cncf/toc#members

test test test

test

YouTube Channel (Content, Curation, Production, Process)

Determine Meeting Timezone(s)

[Observability Whitepaper] References missing and possible wrongly linked

For example

"In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" [9]

Reference 9 is not present.

Perhaps also it has to be double checked if references are actually correctly referenced.

Community research[4] conducted by ClearPath Strategies and Honeycomb.io

4 is:

BEYER, Betsy; MURPHY, Niall; RENSIN, David; KAWAHARA, Kent; THORNE, Stephen. The Site Reliability Workbook. O'Reilly Media, 2018. Available at: https://sre.google/workbook/table-of-contents/. Accessed on: June 24, 2021.

I don't think (?) that this is the research by ClearPath and Honeycomb.

Idea: CNCF Container registry observability for project maintainers

Suggested at our TAG meeting - CNCF projects currently can see nothing more than total pull counts from today's main registries. Things like tags, runtimes, uniques, etc could be useful to project maintainers but are currently not exposed. It would be great to learn more about what kinds of distribution metrics would be helpful to CNCF maintainers, to get a better sense of how we can best work towards more observable open source container distribution.

Whitepaper on cloud-native observability

Goal is to support users in implementing observability and monitoring for their cloud native workloads

Target: End users building cloud native applications

Scope: Define basic concepts of data collection and analysis and how CNCF projects can be used for this. Maybe add 1 - 3 real world reference examples

Details:

Data collection: Logs, Metrics, Traces
- What to use which data source for
- Examples with CNCF projects, Prometheus, Jaeger, OpenTelemetry, ...
Make your Kubernetes cluster - and the apps running on it - obseravable
Storage backends for data
- Options based on CNCF projects
- Enterprise requirements: HA, RBAC, ....
Data analysis patterns:
- Alerting, Anomaly detection, trace analytics, log analytics

s/sig/tag for charter

Context: cncf/toc#654

Links are broken in the charter due to the rename of SIGs to TAGs. Also, the default branch in the TOC repo was changed from master to main.

FWIW @amye already made these changes in the TOC repo; they would just need to back-ported here.

Happy to take a first stab if that's welcome! I understand if you don't want randos submitting PRs to your core document 😉

Publish process to become TAG Tech Lead with responsibilities for this role

Publish process to become TAG Tech Lead for Observability listing role and responsibilities.

Grant access to QLS workgroup leads

Grant write access to QLS workgroup leads Chris Larsen and Vijay Samuel.

Add list of subject matter experts who have presented at TAG observability speaker series

List speakers, links to their slides (if available) and recorded talks.

Generate initial README.md and sig-observability content

Create SIG "front door" with an approachable definition of the SIG & charter, and a summary of the projects and scope.

Guidelines for developers on how to implement new metrics

Hello everyone! Currently, I'm a Trainee under the CNCF's Community Bridge program, working for the KubeVirt community.

My project is to improve KubeVirt's observability by implementing new metrics that represent what is going on with the KubeVirt environment. And after I got a little more comfortable with Open Source development, I also started to look for other projects that need help with observability and found Flux.

I've noticed that the first step before doing anything with open source, is to write a proposal so other people can give some comments, thoughts, and suggest changes based on best practices about what is about to be implemented. However, when discussing new metrics, we can't seem to find an agreement about some points, mostly because we cannot find a curated guideline about some issues, which I will present below.

This might be a really extensive issue, but I'd like to provide as much information as possible to present my thoughts 🤔
Hopefully, I won't bore you guys too much 😬

Metrics granularity

Let me give an example. Flux can synchronize Kubernetes manifests in a git repository with a Kubernetes cluster. To do so, when configuring Flux, one needs to inform which git remote repository, which branch, and which directories' path that flux will look up to when doing that synchronization.

Flux has a metric called flux_daemon_sync_manifests with a label success=true|false, which indicates the number of manifests that are being synchronized between repository and cluster and if the sync is successful or not.

When working with a single git repository, this may be enough, but let's say that someone is using flux to sync multiple repositories with a multi-cluster/multi-cloud environment. It will be hard to pinpoint exactly where is the problem when receiving an alert with flux_daemon_sync_manifests{success="false"} > 1

To solve this, we can use some approaches:

Tell users to use different Prometheus/Alertmanager servers for each Flux deployment
Add new labels to flux metrics that will help to pinpoint problems and set up better alerts.

I think everyone agrees that we should stick with the latter. But how much information is too much?

The most obvious labels are git_repository, git_branch and path, but we could also add some not so obvious ones like manifest which will tell exactly what file failed/succeded the synchronization, and we could also split the metric into flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success and add error to the first one indicating what was the error faced when doing the sync.

Of course, adding new labels always helps us build more detailed monitoring solutions, but they will also need more storage capacity and provoke more performance issues with PromQL(not sure about the last one).

I'm almost sure that a user can drop unnecessary labels from metrics at scrape time, but I don't think that should justify developers to add every label that could be useful to every single use-case.

Differentiate between what should be a metric or a label

I could use the same example as above. Should flux_daemon_sync_manifests be a single metric with the success label? Or should it be split in two; flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success?

But let me give you another example:

KubeVirt is capable of deploying Virtual Machines on top of a Kubernetes Cluster, and KubeVirt exposes metrics regarding the VMs performance and resource usage.

When looking at node exporter's approach with disk metrics, it exposes metrics for read operations and another one for write operations. For example:

node_disk_ops_reads - for the total amount of reading operations from a disk device
node_disk_ops_written - for the total amount of writing operations in a disk device

KubeVirt's approach, on the other hand, is to expose a single disk metric, but with the label type=read|write.

kubevirt_vmi_storage_iops_total - for the total amount of operations in a disk device, has a label to differentiate read and write ops

Both approaches work just fine, but which one is better? Do they have any differences performance-wise?

Every time a developer knows that a given label's value will ALWAYS be within a pre-defined set of values, the developer can choose whether implement several metrics or just a single one with an extra identifying label.

How to work with historical data

To better explain this one, I guess I will have to show you the problem I'm facing with Kubevirt.

As previously said, KubeVirt deploys VMs on top of Kubernetes. KubeVirt can also migrate VMs between nodes, which is necessary when a node becomes slow or unresponsive. Virtual Machine Instance(VMI) and Virtual Machine Instance Migration(VMIM) are implemented as Kubernetes Custom Resource.

VMIM has some useful information like, End and Start timestamps, target node where the VMI is being migrated to, source node where the VMI is being migrated from, which migration method is being used, and they can be in different stages: Succeeded, Running, and Failed.

Every VMI and VMIM that is posted to the K8s API is stored in etcd and as long as the VMI still exists in the cluster, we can retrieve its data and expose them easily. Once we delete a VMI, the VMI and all VMIMs related to it are deleted from the etcd, and then we can't expose information about them anymore.

However, users want to analyze and correlate problems from previous VMI migrations with the existing ones, so they can identify why some migrations are taking more time than others and why they fail or succeed.

Let me try to explain it like this:

Each row is a time-series for a VMI migration metric
Each column represents 1h in the timeline
Let's assume that the Prometheus server was configured to keep metrics in HEAD for 3h before indexing it in the TSDB
o represents that the metric was collected in that particular moment of time
- represents that the metric was not collected in that particular moment of time
[ ] represents where the Prometheus HEAD is pointed to for a particular time series

We can whether:

Keep VMIM objects in etcd and always expose old migration metrics
- Too much disk space/memory required for both metrics storage and etcd
- Will always keep every migration in-memory, thus, easy to analyze.

/\
| o  o  o  o  o  o  o [o]    #migration 1 UID=1
| o  o  o  o  o  o  o [o]    #migration 2 UID=2
| o  o  o  o  o  o  o [o]    #migration 3 UID=3 (last one for that particular VMI)
------------------------->

Remove migrations information of old VMIs from etcd
- Less storage capacity needed
- Will have to deal with historical data at the Prometheus' side

/\
| o  o  o  -  -  -  -  -     #migration 1 UID=1
| -  -  -  o  o [o] -  -     #migration 2 UID=2
| -  -  -  -  -  -  o [o]    #migration 3 UID=3 (last one for that particular VMI)
------------------------->

Let's say that I want to create a dashboard with information about all migrations that have ever happened within my cluster. With the first approach, a simple query like this would be enough: kubevirt_vmi_migration_metric_example, since everything is in memory.

Once a time series is removed from Prometheus' HEAD, I will have to work with queries with time ranges, most probably with remote storages like Thanos or InfluxDB as well. The queries' return will not be metric values anymore, but rather metric vectors, which require to be treated differently. It's not an impossible thing to do, but surely must be thought carefully.

I'm sure there are good solutions for everything that I'm bringing with this issue, and I'm also sure that there are several other problems that I couldn't think of right now. What I'm asking for is to have a centralized place with documentation and guidelines for developers who are trying to improve applications' observability.

Perhaps it could be study material for a future Monitoring and Observability Certification, by CNCF. 👀

But anyways, this will help a lot anyone who is writing proposals for Open Source projects or anyone who is trying to follow CNCF's guidelines for Cloud Native Observability.

Idea: Perform Demo and presentations of "new" observability signals

Idea that came from our TAG meeting community 🤗

Publish recordings of OTAG and workgroup meetings to YT regularly

Document process for publishing recordings of TAG and workgroup meetings to YT channel so that this task can be shared by all TAG chairs.

cc: @halcyondude can you please put a GDoc together that has a published process :-)

Idea: CNCF Observability for edge clients (browser, mobile apps, edge clusters)

From our TAG meeting, we found that there is a potential gap in open source CNCF here. Not many figured out this story yet fully. Worth researching more / try to work with the community on the potential solutions.

Prepare Incubation review for the Thanos project.

Looks like we need a review from Sig Observability to move forward with Incubation stage for Thanos cncf/toc#342.

I'm also contacting Cortex people so we can maybe do Incubation stage together as the systems are quite similiar and due dilligence will have a lot in common.

Ref: #12
cc @gouthamve

Whitepaper: Add section about instrumentation

It would epic to clarify what's manual vs auto instrumentation are. This also means explaining Prometheus exporter pattern, mentioning semi-manual instrumentation (libraries auto instrumenting functionality) etc.

Draft proposal for initial set of Working Groups

(from the TOC TAG document)

TAGs may choose to spawn focussed and time-limited working groups to achieve some of their responsibilities (for example, to produce a specific educational white paper, or portfolio gap analysis report). Working groups should have a clearly documented charter, timeline (typically a few quarters at most), and set of deliverables. Once the timeline has elapsed, or the deliverables delivered, the working group dissolves, or is explicitly re-chartered.

We need a clear and approachable process for community members (and those that are considering joining) to propose and potentially form a new Working Group.

To jumpstart and facilitate this:

Identify and Enumerate potential WG's (the charter has a bunch to start with)
Propose a basic process we'll use to evaluate and create new WG's that's simple and clear
Create an artifact (.md in this repo) that can be shared with community capturing this.

WG: Create GitHub Pages site for TAG Observability

Create a visually engaging site for the TAG!

https://pages.github.com

Whitepaper: Consider "promoting" Continuous Profiling to primary observability signals.

Suggestion from our final 1.0 community review period by @nicolastakashi.

It might be indeed time to promote Continuous Profiling to main "three pillars" (which we renamed to primary signals). Not sure what are benefit of that or when we do this. Also how to draw those 3 circles we have in Figure 1 (:

Perhaps worth to discuss it!

Create presentation slide deck. (1-3 slides)

This will be a 1-3 slide presentation representative of the initial thoughts and focus of this project. It will serve as a basis to solicit feedback from TAG-Chairs, TOC, and the TAG-Contributor-Strategy team, as well as all attendees of the presentation meeting.

WG: Foster and encourage in-person observability-focused meetups.

Brainstorm: What makes a fun meetup? What can we do to help?
create materials (slides, tips, templates) to lower the barrier to forming these groups
Create a registry/list of meetups
Propose an actionable plan to jumpstart a few, solicit TAG member volunteers.

Document SIG Observability process for tracking work, review process, Issue labels, etc.

TAG Observability Logo

We need a logo! Folks at the TOC will complete a number of workflow steps once we have our logo created and in place.

Choose a logo
Update readme, landscape graphic, and other places with official logo
Write up "why owls"

WG: Observability Personas

Proposed personas to get us started:

App / service developers
Operators
Vendors
Projects
Cloud Providers
Projects & Vendors not yet in CNCF
TOC

For each, identify some number of specific examples and reach out to them. Listen to their challenges and wins. Our Personas should be derived from primary sources, should be vetted, and the description of their needs, challenges, and expectations for their engagement with us.

WG: observe-k8s

Contribute to adding profiling as OTEL supported event type

As profiling has grown in popularity and more companies have started to build tooling around profiling, there has been a lot of discussion about adding profiling as an official OTEL supported event type.

Logs, metrics, and traces have already gone through a similar process to become more standardized and seeing as we have identified profiling as an "increasingly important" observability signal in our whitepaper it would be great for the observability tag to help provide input / resources towards this effort.

Several interested groups from a profiling-developer slack channel, the OTEL issue linked above, and some who discussed this live at Kubecon EU this year have started to plan for making progress on this and I've compiled some general thoughts about possible next steps in a doc here.

I'd love to discuss this as a group at the next meeting and also to gather thoughts, ideas, opinions, etc here as well!

Whitepaper: Add section on defining healthy observability goals.

Suggestion from our final 1.0 community review period.

How to define clear observabilty goals? What are healthy goals? What is too much, what's not enought?

cc @nicolastakashi @zeitlinger

cncf/landscape graph - kickoff!

Welcome!

This issue is to provide a way for members of the TAG Observability community to express interest in contributing to the https://github.com/cncf/landscape-graph project.

ACTION: Please leave a comment or the emoticon of your choice on this issue.

We'll target our project kick-off meeting 2/27 - 3/3 (doodle will be distributed, meeting(s) recorded).

The Landscape Graph project is looking for YOU! We are seeking folks who are interested in contributing to the project. We welcome anyone who want's to collaborate in the open. All of the following disciplines might find interesting challenges and the potential for novel work.

Project & Program Managers
Data Scientists, with a focus on Graph Data Science, AdTech, FinTech, and social networks.
UX / Design professionals
Financial Analysts
developers (front end, back end, UI, k8s)
- (any of, not all) Typescript, Python, Bash, Java, Cypher, SQL, GraphQL, ...
- Kubernetes, Cloud Infrastructure, ...
Creative Content (technical writing, art, design)
anyone with a passion for Open Source Ecosystems, Communities, and assessments of project health and composition.
Those wishing to work in a tech stack that's wildly useful for Observability Solutions.

Details

project home: https://github.com/cncf/landscape-graph
#landscape-graph
landscape graph slides (short)
current activities

Why does this project exist?

Graphs can facilitate rich analysis of our vibrant and dynamic communities, the humans they comprise, and the clusters of contribution and thought leadership they produce.

Often, we need to understand how an open source project interacts with others, how it's changing over time, and who's enabling it's continued success. We want to understand what alternatives exist, or how complementary projects might be combined in purpose-fit or novel ways. We might want to dive in and contribute! This is how projects and ecosystems grow to meet business challenges facing modern organizations.

What's the overall approach at the outset of the project?

Using the data underlying the existing landscape as input, a Labeled Property Graph (LPG) is constructed using Cypher (SQL for Graphs), resulting in knowledge graph.

Landscape Graph Core Data Model (initial)

Contribute Observability terms to https://github.com/cncf/glossary

The Business Values Subcommittee (BVS) has launched the Glossary Project (glossary.cncf.io).

There's a great opportunity here to contribute some of the terms related to Observability and to prune out duplicative work (e.g. TAG's own glossary) and collaborate+contribute instead.

Start here --> https://glossary.cncf.io/contribute

WG Content

Let's create one; we will see over time who is interested in doing what.

Some common roles (not a prescription of what needs to be done or how):

YT/Twitter/shepherd
Content creation
Content pipeline (new speakers, etc)

Acceptance Criteria:

Friendly starting point, how to start learning about observability for cloud-native apps
Instead of repeating, link to project's docs, (some nice search as well?)
Establish a feedback channel in case of missing docs/answers with observability projects.