w3c / trace-context Goto Github PK

View Code? Open in Web Editor NEW

463.0 75.0 73.0 490 KB

Trace Context

Home Page: https://w3c.github.io/trace-context/

License: Other

HTML 11.32% Python 88.68%

distributed-tracing

trace-context's Introduction

Trace Context Specification

This repository is associated with the Trace Context specification, which specifies a distributed tracing context propagation format.

Trace Context v1 has W3C Recommendation status.

The next version of specification is being developed in this repository.

The current draft of the Trace Context Specification from this repository's main branch is published to https://w3c.github.io/trace-context/. Explanations and justification for decisions made in this specification are written down in the Rationale document.

Team Communication

See communication

We appreciate feedback and contributions. Please make sure to read rationale documents when you have a question about particular decision made in specification.

Goal

This specification defines formats to pass trace context information across systems. Our goal is to share this with the community so that various tracing and diagnostics products can operate together.

Reference Implementations

There are few open source and commercial implementations of this trace context specification available.

A simplistic regex-based implementation can be found in the test folder. This implementation has 100% compliance to the test suite.

.NET Framework will ship trace context specification support in the upcoming version. See Activity for implementation details.

A list of all currently implementations can be found here.

Why are we doing this

See Why

Contributing

See Contributing.md for details.

trace-context's People

Contributors

Stargazers

Watchers

trace-context's Issues

Provide vendor/open source product name register for tracestate

Based on our workshop, see #57 . Every product need a name in header, which should be unique in global, in order to make sure the context is safe for each vendor or product.

I proposal to maintain a vendors or products name register list, to make sure no conflict about name.

Describe strategies used to work with sites who don't have fixed-width IDs

Predefined ID formats allows propagation and reporting to interoperate for the constraint of those who can fit their traceId and span Id into 128 and 64bits respectively. It also allows log correlation or 3rd parties to tokenize the IDs in understood ways. It does not help those who have different ID schemes (for example bread crumb ids, path ids etc).

Those who have other ID formats may be served by an exchange. For example, multi-tenant systems typically do not trust user created ids. They might accept an ID in the pre-specified format, link it to their internal IDs (which may or may not use the same format), carry it forward and potentially out of the system on the other side in the old format. Patterns like this could be enumerated and tested.

Other approaches could be to embed a signal into the ID format indicating it is not standard. Systems could make a "best effort" to join the trace, and define how they would fail when for example unicode is sent instead of hex. In many cases, tracing simply wouldn't work across these links, but we would be able to define what expected behavior could be.

This issue could serve as a brainstorming ground for how to ensure interop isn't compromised in aims to support systems that are not present at the moment.

Is the W3C 3-clause BSD License acceptable for integration tests?

the LICENSE file says:

Contributions to Test Suites are made under the W3C 3-clause BSD License

Is this going to be a problem for vendors who want to submit something to run in the official tests? Why can't we use Apache 2 license?

List all the projects that are willing to change and use this format

Here's a working list of tracing systems and constraints related to this specification

System	Trace ID	Span Id	Flags	Notes
OpenCensus	128bit	64bit	sampled
Jaeger	128bit	64bit	sampled, deferred, debug	• we set the debug flag when user code does `span.SetTag(SAMPLING_PRIORITY, v) v > 0`. • it always sets `sampled: 1` (in our case sampled bit is authoritative, not a suggestion - suggestion would have `deferred: 1`) • the backend will not down-sample the data any further (which it can do for normal sampled traces to shed the load)
Tracelytics	160bit	64bit	?
Zipkin	128bit	64bit	sampled, debug	• In B3, sampling is implied as authoritative downstream, and can be null • incompatible w/ B3 approach of single-span/RPC, so needs work both server side and all tracers.

Please note your project if not already on the list, so we can keep requirements.

Remove "hexadecimal" restriction

This has been discussed in the original PR (starting from #1 (comment)) but got lost/ignored. In the current form the spec excludes tracing systems that may be using differently formatted strings for trace and/or span IDs.

Explain/remove 512 limit of tracestate in spec

We have explanation of 512 limit in rationale file, but not in spec. Suggestion was made to put it into spec.

Catalog of well-known Correlation Context key names

In order to improve interoperability of tracers we may start cataloguing the well-known names of correlation context key names. This issue is to discuss whether this catalog will be helpful and to collect the initial set of names.

Define the charter more precisely

https://github.com/TraceContext/tracecontext-spec/blob/master/HTTP_HEADER_FORMAT.md#trace-context-http-header-format says

A trace context header is used to pass trace context information across systems for a HTTP request. Our goal is to share this with the community so that various tracing and diagnostics products can operate together, and so that services can pass context through them, even if they're not being traced (useful for load balancers, etc.)

what does "operate together" mean? If we have two applications A and B instrumented by two different tracing systems with different backends, what is the behavior we're expecting?
"services can pass context through them, even if they're not being traced (useful for load balancers, etc.)" - this goal does not require any specification of the format, just the standard header name

Do we need sampling to be more than one bit in trace-parent to define priority

The first draft of the http header spec had 140 comments in various areas. Not all came to closure. One that remained was if we do anything about vendor-specific sampling priority data, or relegate that as a stakeholder in the key/value aka baggage functionality.

Here's background from the original pull request (#1)

One line of comment started from @bhs
This discussion was about flags and why we had initially 4bytes allocated, more room implied problems with interpretation such as endianness. Bogdan had reserved extra space as some proposed they needed room for sampling. This was reduced to a single byte and a question raised about if we should extend if for sampling priority. This particular line of comment went quiet.

Another started afterwards from @SergeyKanzhelev
Sergey suggested that since sampling implementation would be vendor-specific, why not use a separate header? For example, not all vendors do consistent sampling per trace (ex some do sampling again at span level). Yuri mentioned benefits of a consistent trace-level algo. Sergey agreed with this and also that a sample bit as a force-trace operation is useful. However, he believed this should be a "debug" flag, not a sampling one. He suggests the sampling info should be vendor-specific.

After some chatter from Adrian, @SergeyKanzhelev maintained "I still think it would make sense to detach sampling/debug flag from the identity information."

The last comment on this sampling was from @bhs, when mentioned that one can stash the sampling rate (or reciprocal sampling rate) if arbitrary key/value propagated tags were supported somehow, and that presence of a sampled bit could be ignored/overlooked when they don't match a model.

Allow to mutate mutliple keys of tracestate

Spec should allow to mutate multiple tracestate entries in a single mutation. Here are scenarios that might be enabled:

1. Share base SDK

Single base SDK that has it's logic of updating tracestate key may need an extensibility model for special type applications with additional tracestate key required.

Application Insights is designed for high-tenant apps. SDK today requires to propagate an app-id between components which allows to build a full trace efficiently. By knowing app-id of a caller backend knows which telemetry store to query for data.

There are also internal applications using hierarchical identifiers to address spans. Hierarchical identifiers will not necessarily be implemented in Application Insights SDK. Also not every component is an internal app and it may not support hierarchical identifier.

That's said internal applications will need two keys in tracestate. One that is set and maintained by base SDK and containing app-id. It will enable high-tenancy. Second - for hierarchical identifier and additional scenarios it brings.

One alternative here is to host smaller tracestate inside one of tracestate's key. This smaller tracestate will also need key/value store and values will point to specific span-id to know who was the parent in it's own trace graph. However this makes field value look worth without benefit.

2. Share trace properties

It is beneficial for many scenarios to share some common trace characteristics like sampling score. See this #8 (comment).

Sharing a key in tracestate allows to enable even more vendors collaboration scenarios. Alternative for this is parsing each-other headers.

Drawback of documenting trace properties is a risk that tracestate will be used as a baggage.

Are traceparent fields optional or fixed width?

When trace-options is missing the default value for this bit is 0.

This part of the http doc hints that the field could be left out. Are any of the fields something we can leave out? Or is this a mistake?

Compatibility and Interoperability Test Suite

Objective

Have a standard test suite that can take 2 or more "node", and execute a series of transactions that traverse each system, with the purpose of verifying that they exchanged tracing headers in such a way that they were able to interoperate.

Benefits

By defining the tests and expectations described below we essentially define functional requirements of what we expect to happen in different interop scenarios, something that we're currently missing from the spec and which, imo, causes many round-about discussions of implementations without clear requirements.

Details

Node

Represents a microservice instrumented with a certain tracing library / implementation. Comes packaged as a Docker container that internally runs the tracing backend (or a proxy) and a small app that:
a. has a /transaction endpoint that executes the test case transaction
b. has an /introspection endpoint used by the test suite driver to verify that the respective tracing backend has captured the trace

Transactions

A transaction is described as a recursive set of instructions to call the next Node in the chain or to stop. E.g. it might look like

{
  callNext: {
    host: "zipkin", // name of the Node container running ZIpkin app, reachable via this host name
    callNext: {
        host: "jaeger",
        callNext: null
    }
  }
}

Running this transaction would execute a chain zipkin -> jaeger. When a Node receives such request, it looks for the nested callNext fragment and calls the next Node with that nested (smaller) payload. The last node will receive an empty request so it simply returns.

There also can be a convention that each Node's response contains the trace/span ID it observed/generated, again as a recursive structure, e.g.

{
  traceId: "...",
  spanId: "...",
  next: {
    traceId: "...",
    spanId: "...",
    next: null
  }

This would allow the test driver to interrogate the introspection endpoint using those IDs.

Verifications

Test suite driver calls /introspection endpoint for each Node to retrieve captured trace(s) in some canonical form (just enough info for the test). If /transaction responses contain trace/span Id, it can do some validation.

Test Suite

The test suite is defined as a list of scenarions, e.g.

vendor1, vendor1, vendor1 (i.e. a single vendor site)
vendor1, vendor2, vendor1 (cross-site transaction)
etc.

Each scenario is instantiated multiple times (test cases) by labelling different vendors with roles from the scenario, e.g.

scenario 1, test case 1: vendor1 = zipkin
scenario 1, test case 2: vendor1 = jaeger
scenario 2, test case 1: vendor1 = zipkin, vendor2=jaeger
etc.

Each test case runs and validates a single transaction, and checks different modes of participation in the trace.

Parameterization

The test suite framework can be also used to test multiple implementations of the tracing library from a given vendor, e.g. in different languages. This can be implemented as either different Node containers (e.g. zipkin_java, zipkin_go), or a single container controlled by env variables.

Participation Modes

The nodes can also support different trace participation modes, at minimum:

respect and reuse incoming trace ID
record incoming trace ID as correlation field, but don't trust it, start a new trace

If the test driver knows ahead of time which participation mode a given Node supports (these can again be parameters to the Node), it can validate the expected behavior.

Prerequisites

Each vendor must be able to provide a Docker image (or several) to act as a Node in the test suite. Ideally the containers should be fully self-contained, i.e. do not require external connectivity. It's possible to implement them as proxies to hosted tracing backends if necessary, but it will make the tests less reliable if those hosted backends are unavailable at times.

It's crazy / impossible

Jaeger internally uses an approach very similar to this one for many of its integration tests, in particular those that test compatibility of client libraries in different languages. Uber released a framework https://github.com/crossdock/crossdock that helps orchestrating these tests and permutations using docker-compose.

Charter Definition

Subscribe this issue for updates on defining the charter

Support for a PII property of a Correlation Context value

Some vendors has sensitive context properties like user name that can be passed to the dependent service inside the organization, but should not be propagated to external services. PII flag of the property can indicate that the value is sensitive and should have a special treatment.

Find alternative meeting time

As discussed in the last meeting, we want to alternate meeting times. I used meetingplanner [1] to show what good and bad times there are for meetings. Unfortunately, there is no good time for all of us.

Overall this seems be an np-hard problem to solve. Here is what I did. I removed all meetings that are between midnight and 6:00 a.m. in any timezone. This gets you the following list.

UTC-time	San Francisco	Brooklyn	Linz	Kuala Lumpur
Tuesday, April 3, 2018 at 13:00:00	Tue 6:00 am *	Tue 9:00 am *	Tue 3:00 pm *	Tue 9:00 pm
Tuesday, April 3, 2018 at 13:30:00	Tue 6:30 am *	Tue 9:30 am *	Tue 3:30 pm *	Tue 9:30 pm
Tuesday, April 3, 2018 at 14:00:00	Tue 7:00 am *	Tue 10:00 am *	Tue 4:00 pm *	Tue 10:00 pm
Tuesday, April 3, 2018 at 14:30:00	Tue 7:30 am *	Tue 10:30 am *	Tue 4:30 pm *	Tue 10:30 pm
Tuesday, April 3, 2018 at 15:00:00	Tue 8:00 am *	Tue 11:00 am *	Tue 5:00 pm *	Tue 11:00 pm
Tuesday, April 3, 2018 at 15:30:00	Tue 8:30 am *	Tue 11:30 am *	Tue 5:30 pm *	Tue 11:30 pm

[1] https://www.timeanddate.com/worldclock/meetingtime.html?month=4&day=3&year=2018&p1=224&p2=4826&p3=319&p4=122&iv=1800

Special character between trace context fieds

One of the request was to add an extra human-readable character between the trace context fields. The current proposal was to not have this extra character.

Proposal to use '-' character between version/trace_id/span_id/options:
Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01000000

This proposal will allow SREs to easily copy-paste or parse the trace-id/span-id/options if needed.

Position zero in tracestate identifies the traceparent

I believe the left-most position of the tracestate should identify the calling trace (the one that wrote traceparent). There were arguments against doing this at one workshop, but it makes a lot of sense if we just make this a requirement. That's why I've done this in #72

In doing so, we remove the ability to split tracestate across multiple header values (as perhaps they might not be put back in order?) I think this is ok. For example, we don't define behavior of multiple traceparent entries, so even special casing multiple tracestate entries is a bit of a strange optimization especially with a length constraint of 512 bytes.

Setup gitter

Folks are using various gitters at the moment, let's create one to discuss this spec.

List all the libraries that are willing to use this format

#4 seems to be reserved for the backend support, I am creating this issue to track the library work.

The Go programming language wants to introduce a trace package and implement this spec. We were considering about establishing our own data format to be able to instrument the standard library internally, and convert tracing systems’ to our format back and forth to be able to traces started outside of Go code. A standard will help us a lot to be able to compatible with the rest of the world without doing the ugly and expensive conversion work.

Some of our possible future items :

net/http.Handler will be able to automatically extract incoming traces, net/http.Transport will inject traces into the outgoing requests.
database/sql will create spans for ExecContext.
Go libraries will be able to propagate traces via context.Context.
Go libraries will be able to instrument code without external dependencies.

Migrate to W3C

See #1

What is the milestone of this spec?

I am aware there are definitely plenty details to discuss for this spec. But, I suggest we can do some works in implementation level. For do that, we should set a milestone, like 0.1 or 1.0, as you like. So the tracer/APM-agents can try to implement the spec, and find out what is next for everyone.

@adriancole @bogdandrutu @SergeyKanzhelev

Should there be a recommendation on how trace IDs are generated?

This was raised in Jaeger here, but it seems like an interesting aspect if we expect different tracing systems to share the same trace ID.

Is RFC 4122 way of generating UUIDs good enough for tracing purposes? Or is it better to use fully random IDs?

Version in header

I don't think you need a version in the header, because:

If you eventually need to revise the format, you can introduce a new header (either as a replacement, or as additional information)
It implies some means of version negotiation (which you don't have AFAICT)
Generally, you only get one go at this anyway :)

Companion header for name-values pairs

Let's define the tags header that will contain name-value pairs that suppose to be propagated by tracing systems.

Call it Correlation-Context
Should be human readable
Make it appende-able (comma-separated) https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
If we need to specify the type of the value (int, bool) we can use the same approach as cookies - name=value;type=int

Support for a type property of a Correlation Context value

Some tracing vendors support types for the context properties. We can add type property to allow an explicit type specification. Something like this:

Boolean

Binary flag. Supported values 1 for true and 0 for false.

Examples:

IsAuthenticated=1;type=bool
IsAuthenticated=0;type=bool

Number

Numeric value as described in IEEE 754-2008 binary64 (double precision) numbers IEEE754.

Examples:

ExposurePercentage=33.33;type=number
Step=10;type=number

Investigate Server-Timing for passing back response headers

Started a thread with the web performance working group [1]. It would be great if we can start a trace context already in the browser. This would massively reduce the effort for implementing real user monitoring in browsers.

[1] w3c/web-performance#22

IETF/IANA registration of header fields

as soon as possible, you want to make sure that you have the right text in the spec that registers the spec with IETF, to get the defined header fields into the IANA header field registry. if you need help with that, let me know.

Clarify the HTTP header name choice for trace

Hi,

The following text made me rise an eyebrow, so I thought I'd give some feedback, maybe the text could clarify those points for future readers.

While HTTP headers are conventionally delimited by hyphens, the trace context header names are not. Rather, they are lowercase concatenated "traceparent" and "tracestate" respectively. The departure from convention is due to practical concerns of propagation. Trace context is unlike typical http headers, which are point-to-point and do not propagate through other systems like messaging. Different systems have different constraints. For example, some cannot read case insensitively, and others forbid the hyphen character. Even if we could suggest not using the same format for such systems, we know many systems transparently copy http headers into fields. This class of concerns only exist when we choose to support mixed case with hyphens. By choosing not to, we open trace context integration beyond http at the cost of a conventional distraction.

Http headers are case insensitive so I'm unsure why the lowercase mention exists. A specification cannot redefine HTTP, and nothing can or would prevent proxies and intermediaries from re-casing as they see fit.
Most http headers are not point-to-point, there is a limited list alongside the ones in the Connection response header.
The reference to "fields" is confusing here, I'm not sure what it refers to.
If many systems transparently copy http headers into fields, then it'll end up copying a vast majority of headers using the convention, so I'm unsure what scenario we cater for here, especially as most pre-existing and deployed tracing headers use the dash, without triggering interoperability problems.

Many thanks.

Switching work into develop branch

I'll make develop branch to be a default branch at Monday. This was discussed at the last meeting. The idea is to separate published draft from active one. Please we aware.

Describe headers using ABNF format

from @AloisReitbauer

Honestly, I find it weird to describe the header format in plain English. We should use EBNF syntax to describe it. As an example look at the grammar for the HTTP Cookie header

...
cookie-date = *delimiter date-token-list delimiter
date-token-list = date-token ( 1delimiter date-token )
date-token = 1non-delimiter

delimiter = %x09 / %x20-2F / %x3B-40 / %x5B-60 / %x7B-7E
non-delimiter = %x00-08 / %x0A-1F / DIGIT / ":" / ALPHA / %x7F-FF
non-digit = %x00-2F / %x3A-FF

day-of-month = 1*2DIGIT ( non-digit *OCTET )
month = ( "jan" / "feb" / "mar" / "apr" /
"may" / "jun" / "jul" / "aug" /
"sep" / "oct" / "nov" / "dec" ) OCTET
year = 24DIGIT ( non-digit *OCTET )
time = hms-time ( non-digit OCTET )
hms-time = time-field ":" time-field ":" time-field
time-field = 12DIGIT

Should standard define a way to tell which tracing vendor is being used?

For instance, one of HTTP headers User-Agent is being used to provide (optionally) information about the caller.

In a world of distributed tracing it may be useful to provide a callee a hint about the vendor-specific format of the trace. So the callee can properly process data in Trace-Context-Ext header. In such case, should the standard define another optional header that contains the name of trace format or a vendor? Or should it suggest using standard User-Agent header for that?

Add language about abuse and risks for context and sampling flag entering/leaving organizational boundaries via public APIs, outbound calls

As discussed just now at the Distributed Tracing Workshop, it would be good to have some language in the spec around Trace-Context, Trace-Context-Ext, and Correlation-Context headers that come from callers outside your organization, or are unintentionally propagated out to HTTP servers outside of your organization's administrative control.

For example, if you have enabled tracing on a service with a public API and naïvely continue any trace with the sampling flag set using the trace ID provided, a malicious attacker (or simply any third party using the HTTP headers) could overwhelm your application with tracing overhead, forge trace ID collisions that make your monitoring data unusable, or run up your tracing bill with your SaaS tracing vendor.

Similarly, if you automatically propagate sensitive baggage values in your Correlation-Context to all HTTP requests your services make, you may be revealing sensitive information to external HTTP APIs services outside your organization's boundary. Or your vendor/implementor may reveal something about your organization in the Trace-Context-Ext header.

AWS X-Ray's docs have a little blurb about this:

Tracing Header Security
A tracing header can originate from the X-Ray SDK, an AWS service, or the client request. Your application can remove X-Amzn-Trace-Id from incoming requests to avoid issues caused by users adding trace IDs or sampling decisions to their requests.

One level down correlation context propagation (TTL=1)

TTL property can be used by vendor to specify the short-living values. Examples may be component name that initiated the request. Such property should not be passed further than the first hop.

Define behavior for response header context propagation

There are scenarios when services need to return correlation information in response. Scenarios include:

Sampling flag (+ sampling score) for delegated sampling
Tenant ID/identity of the service so caller knows where to query telemetry from

Spec need to define which headers SDK may expect and service should use for http response. Also behavior for service mash and proxies needs to be defined so headers would not be lost.

Team Teleconference

As we plan to primarily communicate via Github I created this issue to schedule teleconferences. Please subscribe

Little endian vs Big endian

Some people raised the issue related to the endianess that we use for the trace-options. We need to agree on one of the endianess and use that to encode the trace-options and any future fields.

The current proposal is to use little-endian because it is more natural for all CPU architectures (probably >90% CPUs in used are little-endian).

The suggested proposal was to use big-endian because it is more human-readable, also default in Java and the network-order.

Please vote here for one or the other.

Define behavior when library do not understand the version number

The behavior may differ between not being able to parse header and unknown version number. Spec should make some considerations for the versions revision in future.

About vendor identification

This is our sample context:

Value = 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
base16(<Version>) = 00
base16(<TraceId>) = 4bf92f3577b34da6a3ce929d0e0e4736
base16(<SpanId>) = 00f067aa0ba902b7
base16(<TraceOptions>) = 01  // sampled

And if I am an implementation developer (skywalking APM auther), how can I know about this is my context or cross-tracer-implementation context? By now, because of skywalking have more context than these field, so we are going to set two HEAD_NAMEs. So if we found the other one, we know it's our context. But I think this is not an elegant way.

How about doing a field which represent vendor? And maintain a list for these codes. By having this, the users will know how to use this outer-trace-id, and where to query the real trace, if they have more than two tracing system.

Change requests to the editors draft

Please remove this sentence as it is confusing as that header is not elaborated. It is likely that the header naming convention will also change. Let's add this back when the other header escalated to implementable:

"Correlation-Context header is a companion header representing user-defined baggage associated with the trace. Libraries and platforms MAY propagate this header."

https://w3c.github.io/distributed-tracing/report-trace-context.html

Charter questions from W3C review

@dontcallmedom asked a few questions while reviewing the draft charter, and gave me permission to post them here. Thanks!

the mission statement, while concise, is probably going to be hard to
understand for many - "tracing tool" is pretty vague if you're not in
that business. The first sentence of the scope section might benefit
from being brought up here
"Testing plans for each specification, starting from the earliest
drafts" is not a complete sentence
probably related to the above on implementations, do we know what a
test suite would look like for this work?
is there any reason to think that facilitating the interop of traces
would impact negatively privacy? i.e. is there "privacy through
obscurity" that this would weaken?
the link on "Trace Interchange format" should be removed (since it
goes nowhere)
from what I can see, there has been no incubation on the interchange
format yet; should that be a concern per the Rec-track readiness criteria?
given we are already in April 2018, the timeline probably needs to be
shifted for the first few items
I have a hard time parsing "The trace context standard is also
applicable to be implemented for work done within the web performance
working group." - maybe simplify and illustrate?

Missing elements in the spec

Hello, I was just comparing this to what we do with our tracing in AppDynamics, and there are a few things that I believe a standard should have. I'll go from most concrete to most abstract. I am happy to write these up, however, is deemed to be most useful and most detailed.

Support for other protocols. Offhand I see the need to support technologies which are not based on HTTP such as JMS, AMQP. Similarly, we may want to have a way to embed data into backend calls such as JDBC or ADO.NET. LDAP is another protocol we should add support into. I'm happy to provide details in supporting these areas.
Adding in metrics or other data within the header and standard. For example, we embed additional business context or other data in the headers to enrich our agent decision making processes around data capture. These could be handled as baggage, but it would be good to have some of these metrics and business data be usable potentially.
Having the ability to pass a metric between two nodes, or have them be only shorter lived within the headers.
Headers to support thread correlation and async. We have a specific pattern for doing this, and it's typically a problem for any of the OSS tracing products I have seen. There are ways to do thread correlation which may be from a single transaction upstream, hence this is critical the way the header is designed and used.

Hope these ideas make sense, happy to discuss and contribute!

Refactor spec towards trace-parent and trace-state

During the last workshop, we discussed the role of the two headers somewhat differently than before

trace-parent and trace-state are the headers defined in the trace context specification, these replace trace-context and trace-context-ext respectively. The major point here is that we understand there will be multiple trace graphs possible related to an incoming request. For example, an api gateway from amazon and zipkin. In brief, the new names clarify the portable format describes the incoming request, and the second header is not an extension of it.

MVP @erabug for giving us the push from the cryptic name trace-context-ext -> trace-state

trace-parent (formerly known as trace-context) describes the position of the incoming request in its trace graph, in a portable format
trace-state includes a mapping of all graphs the incoming parent is a part of in potentially vendor-specific formats. For example, if an ancestor of this request was in a different trace, there would be a separate entry for the last position it was in that trace graph

The trace-state section is not reliant on data on the trace-parent, except if the parent has no data beyond trace-context format. For example, the below hints that there is only one trace graph and that's amazon's

trace-parent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
trace-state: aws=BleGNlZWRzIHRo
ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=

The trace-state section is not an extension of the data in trace-parent, rather it is the gold copy per vendor.

Please don't get caught up on the label "aws" or the base 64 encoding. The above example isn't realistic in either case as we don't know the label or encoding AWS will use. Suffice to say this demonstrates the values are different!

In the case that the incoming trace is in generic format, the value of trace-state can be left out. This can help with header compression while still allowing branding, tenant, or otherwise custom labels to be used. These labels can help differentiate different systems that use the same format

trace-parent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
trace-state: yelp

The above is shorthand for

trace-parent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
trace-state: yelp=00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

but better for http compression and performance in general

While I mentioned some details, in order to keep this issue focused on the rename part, it is intentionally not covering other parts of the state model. These can be opened later, for example elaborating what happens hop-to-hop. This issue should be closed out with a rename and a corrected description of the two headers.

Async Downstream Correlation

One of the limitations of all non-commercial standards is handling of asynchronous transaction measurement and tracking. This is an advanced technique which many APM tools do not do well today. In order to do this well you have to understand the current problem:

The user makes a request to SystemA. SystemA runs several threads which spawn several new transactions. These transactions need to be tracked and attributed back to the initial user request. This poses a complex problem both on the instrumentation, downstream correlation, and reporting/visualization.

I believe we need to add headers to allow new transactions to also be attributed to a master transaction. These could be to track threads in the process, and threads out of the process. Often times having two headers make more sense to be able to attribute them back to a specific runtime or even a second runtime running on the same (or another) system.

What should be the max length of all Correlation Context name-value pairs combined?

The only rationale for the max header length (8K) is that 8K seems to be long enough to handle many properties and short enough to protect tracing vendors.

This issues is for rationale discussion for this limit

Add comment on ordering of properties and state key/value pairs

Define behaviour on how Trace-ID are to be modified.

Standardize format for "simple" baggages (a.k.a "tags")

Background:

OpenTracing call them "baggages" - they are a simpler version of [this paper] (http://pivottracing.io/mace15pivot.pdf)
OpenCensus call them "tags"

Open Questions:

1. Use a different header than "Trace-Context" or extend that format?
1. What types do we support: key -> string? value -> (string, int64, bool) ?
1. Size restrictions: Per key or value? Per entire set of baggages/tags? None?

Define purpose of correlation context

Currently the definition and example of what correlation context is about are misleading. We should define that this is for tracing tool usage only. We should adjust the examples to reflect vendor specific extension like hierarchic correlation.

Happy to take the first run at this one.

Backend Detection

I do believe we need some kind of construct to identify a "backend" this would mean either having a way to pass back to the monitoring system that the transaction has reached its deepest part and what that part was. For example, in Java, if the system calls a MongoDB backend, we should show this and pull other stats on the backend. We could go farther and collect other data about the call to that backend. Example backends could be a message queue (without instrumentation on the other end) database, data platform, streaming platform (Kafka + others), HTTP (call to external API), or other transaction processing system (tuxedo, mainframe, etc), RMI, protobuf. The instrumentation library would have to auto recognize these, allow the developer to define or otherwise illustrate them. This would be communicated back to the monitoring system or tool.

Correlation Context Should be Split

Having a single header for all of this data will be detrimental for several reasons. The first being that metrics are numbers, and we can do maths on them. We'll likely have systems which want these metrics for context. They can be sniffed or captured at various levels. Similarly, there are many use cases for Metadata, it could be the context of a transaction for other tools (security, forensics, topological systems), or instructions for orchestration. There are various reasons these should be split.

Metrics - measurename=number , processingtime=34.534
Metadata - metadataname=metadata , browserlocation=Boston,MA