spandex-project / spandex Goto Github PK
View Code? Open in Web Editor NEWA platform agnostic tracing library
License: MIT License
A platform agnostic tracing library
License: MIT License
Due to https://github.com/zachdaniel/spandex/blob/master/lib/datadog/span.ex#L116, if you set meta
on any span with Tracer.update_span
, all children will also receive the meta. Is this intended?
I think it's kind of confusing to have negative configuration options like disabled?: true
rather than enabled: false
. I also think it looks weird to have configuration options with question marks at the end, but I might be the one who's wrong on that one. It bothers me less when it's a function name.
I think we could make this change without breaking anything, so it doesn't need to wait for the 3.0.0 milestone.
In order to become more versatile, these things should be written as separate tools/libs. Having them in spandex causes people to need dependencies they would otherwise not need.
Our documentation is currently incorrect, as it claims that a traced/1
decorator exists that would allow for adding context to the parent span.
I have the problem so that I'd like to include only bottom level spans in my Trace (only leaf spans)
So that starting from some distinguished span, only the spans that does not have childs will be forwarded.
Is it possible to achieve that without creating the fork of this library and implementing such a feature?
We should follow modern patterns for configuration, and use a pattern whereby configuration is scoped to a module, such that there could be multiple, differently configured tracers.
Currently metadata is only allowed in update_span
arguments.
For simplicity, it would be nice if we can pass metadata arguments to Tracer.span
with block, like
Tracer.span("span_me_also", resource: "aaa", sql_query: [query: "..."]) do
...
end
While running blockscout's tests I get a lot of warnings:
warning: the Collectable protocol is deprecated for non-empty lists. The behaviour of things like Enum.into/2 or "for" comprehensions with an :into option is incorrect when collecting into non-empty lists. If you're collecting into a non-empty keyword list, consider using Keyword.merge/2 instead. If you're collecting into a non-empty list, consider concatenating the two lists with the ++ operator.
(elixir) lib/collectable.ex:83: Collectable.List.into/1
(msgpax) lib/msgpax/packer.ex:151: Msgpax.Packer.List.pack/1
(msgpax) lib/msgpax/packer.ex:152: anonymous fn/3 in Msgpax.Packer.List.pack/1
(elixir) lib/enum.ex:1940: Enum."-reduce/3-lists^foldl/2-0-"/3
(msgpax) lib/msgpax/packer.ex:151: Msgpax.Packer.List.pack/1
(msgpax) lib/msgpax.ex:85: Msgpax.pack/2
(msgpax) lib/msgpax.ex:122: Msgpax.pack!/2
(spandex_datadog) lib/spandex_datadog/api_server.ex:192: SpandexDatadog.ApiServer.send_and_log/2
(spandex_datadog) lib/spandex_datadog/api_server.ex:169: anonymous fn/3 in SpandexDatadog.ApiServer.handle_call/3
(elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Travis CI seems impossible to deal with. I'm having all sorts of issues provisioning it, or even getting it to show up under this organization. There is also like a legacy travis CI account that I have, that might run CI on this org? I can't seem to install it in this organization using the marketplace.
Spandex should expose structs for representing traces/spans, and then each adapter uses this structure. We would probably have a private
field for adapter specific fields.
Currently traces are publish synchronously by the process doing the tracing, which is definitely not scalable.
Is there possibility to add Datadog's tags into Traces?
So that except the metadata in the span, we tag
the span, e.g with result:success
Based on comment from DataDog slack:
right, we have a special API to report the list of services with their metadata (this is an internal API which might disappear at some point), it would still work if you don't implement it
that is where we assign to services an "app" and an "app_type" ; only used in the UI to represent a service (the icon in the service list / the text when hovering)
So I would suggest removing the part where we create services for DD and stick to sending spans only.
In the docs it just says "default service name" which I don't understand.
What does this setting actually do?
spandex-project/spandex_datadog#3 illustrates how we're just using a raw map to communicate data from the adapter to spandex. Instead, we should have distributed trace contexts hydrate a struct provided in core.
Hey!
The documented way of tracing Ecto was lost somewhere between v1.3 and v1.6... However, googling for Spandex Ecto
still takes you to that outdated page. Would it be possible to describe or link to the proper way of tracing Ecto right in the main README?
Thanks for doing your great job!
Right now, the library does everything it can not to fail any operations whatsoever, at all. This can mean that updates to spans might fail if they are not valid, or that entire traces aren't started. This also happens silently, currently. We want some kind of strict
mode that fails on any issue w/ any operation, or perhaps a log_errors
configuration.
We don't have a valid use case for each tracer having its own storage key, and in fact this makes using the library much harder than it should be.
See the Plug docs for details, but libraries aren't supposed to pollute the assigns
on the Conn.
https://github.com/elixir-plug/plug/blob/master/lib/plug/conn.ex#L69
https://github.com/elixir-plug/plug/blob/master/lib/plug/conn.ex#L84-L89
We should separate the datadog adapter into a separate repository.
We currently use the process dictionary, which is incompatible with certain designs. We should make both available as options.
1st questions when I opened this project "What does it do? What is it good for?" ;)
Hi!
First off, thanks for making this. I really appreciate it and I enjoyed your talk at the BEAM conference. We use an umbrella app and most of spandex
is working; however, I noticed this line:
https://github.com/spandex-project/spandex/blob/master/lib/decorators.ex#L27
I am trying to figure out which OTP app would be chosen here in Application.get_env
during compile time. I've moved the Datadog.ApiServer
and my Tracer
into a common application that is shared across my apps in the umbrella app.
Currently, I have the decorator set in every config.ex
for each app but I was wondering if this is overkill or if you had any idea with OTP app is the one during compilation time for the decorators.
I have the correct config for my logger:
config :logger, :console,
format: "$time $metadata[$level] $message\n",
metadata: [:request_id, :viewer_id, :trace_id, :span_id]
But the trace_id
is not being logged:
Oct 23 15:51:51 dc1-live-appserver2 eggl[30255]: 15:51:51.298 request_id=iahoj1li85bhul2kf16m54lqsd11000p viewer_id=788210 span_id=4174
988920911713992 [info] QUERY OK source="valuations" db=1.5ms queue=0.1ms
Hey, i've spent the day trying to make this work. looked at the example(which is outdated, but i found a couple of places with new documentation and adapted). i dont see any logs. or exceptions. even when i enable verbose? mode.
Is there any place i can go to to receive assistance? i've even tried to debug the code to no avail :(
current_context
should return a consistent value in all cases, that can be passed to continue_trace
. With that value, continue_trace
should be able to determine if its should start a trace at all and if so, if that trace should be a continuation of previous context or a new trace.
hello,
Sorry for the generalness to this question. I'm having some trouble understanding how traces work in the context of many child processes. I understand that i can use Spandex.get_current_span
and Spandex.continue_trace
to copy the relevant data into the child processes dictionary, but i'm not sure if the continued trace should have a new name. My high-level problem is that we're running a large graphql API (absinthe) where some of the field resolvers run asynchronously via wrapper functions and some do not, and we'd like to use Spandex to make a flame graph where each span represents one resolver. Do you have any suggestions / thoughts about this? Does it make sense to add functions that just copy the process dictionary into a child process without requiring a name?
Currently there are two methods that Spandex provides to do distributed tracing.
inject_context
:
@spec inject_context(headers(), SpanContext.t(), Tracer.opts()) :: headers()
distributed_context
:
@spec distributed_context(Plug.Conn.t(), Tracer.opts()) ::
{:ok, SpanContext.t()}
| {:error, :disabled}
Given that these two methods are intended to be counterparts of each other, would it make more sense if the distributed_context
interface was actually:
@spec distributed_context(headers(), Tracer.opts()) ::
{:ok, SpanContext.t()}
| {:error, :disabled}
An example use case would be injecting and extracting tracing contexts from a GRPC request which does not rely on Plug.Conn
.
Thoughts?
Hi,
I'm working with this library but I can't get information sent to datadog. Please tell me if there is any other diagnostic information I can provide or troubleshooting steps I've missed. I've set up the application as the README suggests and then when my application is running where the datadog agent is, I run the following commands in iex:
Tracer.start_trace("foo")
Tracer.update_span(service: :my_service, type: :web, resource: "/bar")
Tracer.finish_span()
Tracer.finish_trace()
What else I've tried
Source Files:
Thank you
What's is the proper way to pass Trace between Elixir processes? I don't see public methods in Tracer interface how to obtain current Trace and Spandex module uses strategy.get_trace for start_span/update_span. Either I'm missing it or maybe you have some plans to implement it in future.
Right now, a lot of errors are ignored via usage of update_or_keep/2
, which either updates the span and returns the new span, or doesn't update the span at all but returns the old span. This made some sense as a way to have consistent return types, but we're moving towards having more idiomatic return types, and that includes not swallowing errors. More discussion can be found in #63, where the issue was originally discovered. Additionally, that PR adds tests that can be un-skipped when the update functions have these new predictable return types.
I am a bit at a loss where to start debugging this, so this is a bit of a broad ask for help. It seems that when we put some load on the Spandex GenServer it seems to crash from time to time causing spans to fail hundreds of thousands of times with messages like:
GenServer.call(#PID<0.934.1>, {:update, #Function<18.39421655/1 in SpandexDatadog.ApiServer.handle_call/3>}, 5000)
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
GenServer terminating:
exited in: GenServer.call(SpandexDatadog.ApiServer, {:send_trace, ... 30000)
** (EXIT) exited in: Task.await(%Task{owner: #PID<0.2917.0>, pid: #PID<0.9250.0>, ref: #Reference<0.698663207.2437152781.202107>}, 5000)
** (EXIT) time out
Now I cannot pinpoint a specific reason the Spandex GenServer itself crashes, but perhaps the timeout could simply be hit when there is a queue or something?
Any help/directions on how to debug this would be greatly appreciated.
I ran a profiler on my app using the eprof
and fprof
to identify bottlenecks to trace the execution of all functions in the code and report the time consumed with each.
Looking at the results 25% of the time was spent on Optimal, which is a lib responsible to validating code.
So I set up a benchmark script using Benchee to test the original code with the Optimal dependency and another one that overrides it with a simpler implementation.
The result was about 10% faster over the whole request without the Optimal dependency.
https://hexdocs.pm/mix/1.8.1/Mix.Tasks.Profile.Eprof.html
https://hexdocs.pm/mix/1.8.1/Mix.Tasks.Profile.Fprof.html
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
Number of Available Cores: 3
Available memory: 5.82 GB
Elixir 1.8.1
Erlang 21.3.7
Benchmark suite executing with the following configuration:
warmup: 0 ns
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 15 s
Benchmarking 0_warmup...
Benchmarking 1_normal...
Benchmarking 2_optimal_overridden...
warning: redefining module Optaimal (current version loaded from _build/test/lib/optimal/ebin/Elixir.Optimal.beam)
priv/myapp_web/script_optimal.exs:23
Name ips average deviation median 99th %
2_optimal_overridden 270.69 3.69 ms ±33.85% 3.26 ms 7.91 ms
1_normal 248.06 4.03 ms ±35.16% 3.58 ms 9.72 ms
0_warmup 129.54 7.72 ms ±1258.76% 3.45 ms 11.21 ms
Comparison:
2_optimal_overridden 270.69
1_normal 248.06 - 1.09x slower +0.34 ms
0_warmup 129.54 - 2.09x slower +4.03 ms
Spandex needs to be able to toggle logging. The logger can be a big bottleneck in high traffic applications, and logging in erroneous circumstances should be opt in.
ENVIRONMENT
Hey! Phoenix/Plug works fine for me, but I can not do custom tracing:
defmodule Core.Tracer do
@moduledoc """
DataDog tracer
"""
use Spandex.Tracer, otp_app: :dice
end
defmodule ManuallyTraced do
alias Core.Tracer
require Core.Tracer
# Does not handle exceptions for you.
def trace_me() do
_ = Tracer.start_trace("my_trace") #also opens a span
_ = Tracer.update_span(service: :ecto, type: :db, sql_query: [query: "SELECT * FROM posts", rows: "10"])
:timer.sleep(1000)
_ = Tracer.finish_trace()
end
end
ManuallyTraced.trace_me()
Datadog Agent rejects this trace:
[ TRACE ] 2018-08-01 19:35:16 ERROR (receiver.go:250) - dropping trace reason: invalid span service:"ecto" name:"my_trace" traceID:7629694493567207627 spanID:6427408010887947325 start:1533152115901223000 du...
Am I missing some required params in my example?
Maybe this was intentional, but while I was working on tests for the functions in the Spandex
module, I noticed that if you pass in a completion_time
when you call Spandex.finish_span
, it is ignored. If you call Spandex.update_span
with that option just before calling Spandex.finish_span
, then it is updated. We should think about whether it makes sense to warn/log/error when you pass in unsupported opts that are going to get ignored.
We want to make sure that Spandex can be used with large scale implementations, and to do so we need to ensure that sampling is implemented natively.
Currently the only thing that uses the configured defaults are the plugs, not the actual trace/span constructors. We should resolve that so that people using things like the traced
decorator or manually constructing traces can take advantage of setting defaults.
While testing distributed tracing, I noticed that traces were only getting sent when the threshold number of traces had been stored up. This means that if you have very low traffic for a period of time, (i.e. number of requests is less than batch_size
), the traces won't get sent out.
I can understand that one solution to this is probably to tune that number to be lower (if not 1? eek), but one alternative I had in mind was some sort of timeout which triggered sending batches.
Keen on your thoughts around such an idea?
Hi there,
I'm trying to test this library as replacement to a custom monitoring that uses Elixometer/Exometer + Dogstatsd, but I'm confused about how to configure Datadog here.
Should I have the Datadog Agent running locally and set host to localhost
and port to 8126
as the example in the docs does?
I have tried this and I keep getting:
[debug] Trace response: {:error, %HTTPoison.Error{id: nil, reason: :econnrefused}}
Based on the error it looks like the library is trying to do HTTP calls, so maybe it's trying to use Datadog's HTTP API instead of the Datadog Agent? If so, where should I configure my Datadog API key? And what should be the appropriate host and port?
Thanks in advance!
We recently added decorators to our project. Haven't actually gotten it working yet (overlapping traces from ecto or phx?) but that's not related to this issue so excuse me!
Decorators were causing tests failures until we manually added
Relay.Tracer.configure(disabled?: true)
to test_helper.exs.
This is our config:
config :relay, Relay.Tracer,
service: :relay,
adapter: SpandexDatadog.Adapter,
type: :custom,
disabled?: Mix.env() != :prod,
env: release_level
config :spandex, :decorators, tracer: Relay.Tracer
config :spandex_phoenix,
service: :phoenix,
type: :web,
tracer: Relay.Tracer
config :spandex_ecto, SpandexEcto.EctoLogger,
service: :ecto,
type: :db,
tracer: Relay.Tracer
Ultimately, these get encoded back into a map and sent to the collector, so there's no reason to enforce that they be a Keyword list (i.e. atom keys).
For example, it would be nice to be able to include some map of key-value pairs that were decoded from a JSON payload, but you don't want to convert all the keys to atoms. Also in this scenario, it would be nice to be able to have nested maps as values. On a Datadog-specific note: I believe this is possible, but we'd have to traverse the maps and change the names to a flat structure of dot-syntax keys like parent.child.key: value
.
Reproduced in v1.6.1 and edge version from Github.
Let's assume the following:
# config/dev.exs
config :my_app, MyApp.Tracer,
service: :my_app,
# in the module I'm tracing
@tracer_opts [service: :etl, type: :custom, resource: "MyApp.SalesStatsETL"]
Example 1
MyApp.Tracer.trace "trace_name", @tracer_opts do ... end
OUTCOME: in Datadog there's a trace with service="etl"
, which is expected 👍
Example 2
MyApp.Tracer.trace "trace_name", @tracer_opts do
MyApp.Tracer.update_span(
sql_query: [query: "SELECT ..", db: "some_db", rows: "42"]
)
end
OUTCOME: in Datadog there's a trace with service="my_app"
👎
EXPECTED OUTCOME service should remain "etl"
since it was not overridden explicitely
Example 3
MyApp.Tracer.trace "trace_name", @tracer_opts do
%{service: service} = Core.Tracer.current_span()
MyApp.Tracer.update_span(
sql_query: [query: "SELECT ..", db: "some_db", rows: "42"],
service: service
)
end
OUTCOME: in Datadog there's a trace with service="etl"
👍
Looks like the environment variable name posted to Datadog is always set as "env" in the sense of when browsing through DataDog's APM screens it is possible to filter for "env". The point here is that I have many other applications posting the "env" as "environment" thus I end up having 2 filters: env and environment.
Is there a way to customize the "env" var name to "environment"?
In the snippet below, if I try to change env and name it as environment I get an error since environment is not a recognized parameter...
Tracer.configure( service: :booking, adapter: SpandexDatadog.Adapter, disabled?: tracing_disabled, env: "#{apm_tracing_env}" )
Having the following lines in config.exs
config :logger, :console,
format: "$time $metadata[$level] $message\n",
metadata: [:request_id, :trace_id, :span_id]
The trace produced with no logs associated
same time the trace id and span id are included in the log message.
Seems the format of log message is wrong, it should be dd.trace_id
and dd.span_id
according to the docs
Log record is correlated and visible on the trace page.
This might not be worth it, but would be more standard.
Note: Initially, I wanted to open this issue in spandex_datadog, but for reasons I'll describe in a second I'm opening it here.
Most Datadog tracing libraries support specifying the version of the service being traced, in order to track deployments. Usually, this is done with an environment variable like DD_VERSION
, where the version is a string like "2.21"
or "a7b91d"
. The feature allows us to track regressions in performance, monitor canary releases - neat stuff. We at Fresha would definitely use it, so I had the thought of implementing it in spandex_datadog
- until I checked how others do it.
When I looked at other tracing standards like OpenTelemetry - it appears there's also a notion of a service.version
in them. That's when I thought that it would actually make sense to specify the service version around the same time when specifying the service name - either during spandex
tracer configuration or when tracing.
My suggestion is to extend the tracer config and the options allowed in tracer functions so that we can pass a service version along with the service name. We could then propagate those in Span.t()
until they reach spandex_datadog
, where they would get converted to span metadata.
If the idea makes sense, I'll be happy to implement it.
Currently, we break from the abstract at a high level, basically just defining the interface of a tracer. I'm getting started adding a google stackdriver tracing adapter, and what I'm realizing is that pretty much all of the logic stays the same. The only thing that is different is ApiServer, and a few small things (like trace_id/span_ids). So here is the proposition: We make most of the Datadog adapter code only adapter code, and make Spandex.Datadog.Span
just Spandex.Span
. Then, we define an adapter in terms of the small things it needs to change about that process.
defmodule Spandex.Adapters.Adapter do
@moduledoc """
This isn't all the callbacks, just a starting point.
"""
@callback new_id() :: term
@callback now() :: term
@callback send_trace() :: term
end
👋
I realise that the title might be a bit misleading and there is a chance that the problem lies elsewhere, but I hope you can assist me in this.
Environment:
Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1]
Elixir 1.9.4 (compiled with Erlang/OTP 22)
Relevant parts of my mix.exs
file:
{:spandex, "~> 2.4"},
{:spandex_datadog, "~> 0.4"},
{:spandex_phoenix, "~> 0.3"},
{:spandex_ecto, "~> 0.4"},
{:decorator, "~> 1.3"}
In short: I inherited a phoenix app and I have to add some Datadog tracing. I've added everything required, following the docs + a small custom module to use
in every module that I would like to be traced:
defmodule MyService.Tracer do
# All the required default tracer stuff here
defmodule ModuleTracer do
defmacro __using__(_opts) do
quote do
use Spandex.Decorators
@decorate_all span()
end
end
end
end
And the intended usage is:
defmodule MyService.SomeModule do
# I want to span everything happening in this module
use MyService.Tracer.ModuleTracer
end
Now whenever I build my docker image I always get an error on the mix release
step:
== Compilation error in file lib/my_service/some_module.ex ==
** (CompileError) lib/my_service/some_module.ex:1: module nil is not loaded and could not be found
(stdlib) erl_eval.erl:680: :erl_eval.do_apply/6
/lib/my_service/some_module.ex:1: Decorator.Decorate.before_compile/1
Some quick notes:
ModuleTracer
.Am I trying to do something stupid here? That's the first time I am using the spandex-project
so maybe I am messing up something.
Would really appreciate your help, thank you :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.