spandex-project / spandex Goto Github PK

View Code? Open in Web Editor NEW

333.0 333.0 53.0 448 KB

A platform agnostic tracing library

License: MIT License

Elixir 100.00%

hacktoberfest

spandex's People

Stargazers

Watchers

spandex's Issues

Span meta leaking to children

Due to https://github.com/zachdaniel/spandex/blob/master/lib/datadog/span.ex#L116, if you set meta on any span with Tracer.update_span, all children will also receive the meta. Is this intended?

add `enabled` config option and deprecate `disabled?` option

I think it's kind of confusing to have negative configuration options like disabled?: true rather than enabled: false. I also think it looks weird to have configuration options with question marks at the end, but I might be the one who's wrong on that one. It bothers me less when it's a function name.

I think we could make this change without breaking anything, so it doesn't need to wait for the 3.0.0 milestone.

Remove Spandex.Ecto.Trace and the decorators

In order to become more versatile, these things should be written as separate tools/libs. Having them in spandex causes people to need dependencies they would otherwise not need.

Add `traced/1` decorator

Our documentation is currently incorrect, as it claims that a traced/1 decorator exists that would allow for adding context to the parent span.

Only leaf spans

I have the problem so that I'd like to include only bottom level spans in my Trace (only leaf spans)
So that starting from some distinguished span, only the spans that does not have childs will be forwarded.
Is it possible to achieve that without creating the fork of this library and implementing such a feature?

Modernize configuration

We should follow modern patterns for configuration, and use a pattern whereby configuration is scoped to a module, such that there could be multiple, differently configured tracers.

Allow to pass more span metadata to `span` with block

Currently metadata is only allowed in update_span arguments.
For simplicity, it would be nice if we can pass metadata arguments to Tracer.span with block, like

    Tracer.span("span_me_also", resource: "aaa", sql_query: [query: "..."]) do
      ...
    end

Support distributed tracing with headers

as @driv3r mentioned in #5, we could easily support distributed tracing with request headers, e.g x-ddtrace-parent_trace_id x-ddtrace-parent_span_id

compilation warnings for elixir 1.8.1

While running blockscout's tests I get a lot of warnings:

warning: the Collectable protocol is deprecated for non-empty lists. The behaviour of things like Enum.into/2 or "for" comprehensions with an :into option is incorrect when collecting into non-empty lists. If you're collecting into a non-empty keyword list, consider using Keyword.merge/2 instead. If you're collecting into a non-empty list, consider concatenating the two lists with the ++ operator.
  (elixir) lib/collectable.ex:83: Collectable.List.into/1
  (msgpax) lib/msgpax/packer.ex:151: Msgpax.Packer.List.pack/1
  (msgpax) lib/msgpax/packer.ex:152: anonymous fn/3 in Msgpax.Packer.List.pack/1
  (elixir) lib/enum.ex:1940: Enum."-reduce/3-lists^foldl/2-0-"/3
  (msgpax) lib/msgpax/packer.ex:151: Msgpax.Packer.List.pack/1
  (msgpax) lib/msgpax.ex:85: Msgpax.pack/2
  (msgpax) lib/msgpax.ex:122: Msgpax.pack!/2
  (spandex_datadog) lib/spandex_datadog/api_server.ex:192: SpandexDatadog.ApiServer.send_and_log/2
  (spandex_datadog) lib/spandex_datadog/api_server.ex:169: anonymous fn/3 in SpandexDatadog.ApiServer.handle_call/3
  (elixir) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
  (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3

Figure out what on earth is going on w/ Travis CI

Travis CI seems impossible to deal with. I'm having all sorts of issues provisioning it, or even getting it to show up under this organization. There is also like a legacy travis CI account that I have, that might run CI on this org? I can't seem to install it in this organization using the marketplace.

Adapter Agnostic Trace/Span data format

Spandex should expose structs for representing traces/spans, and then each adapter uses this structure. We would probably have a private field for adapter specific fields.

Add trace sending genserver

Currently traces are publish synchronously by the process doing the tracing, which is definitely not scalable.

Datadog Tags

Is there possibility to add Datadog's tags into Traces?
So that except the metadata in the span, we tag the span, e.g with result:success

Remove Datadog.Api.create_services

Based on comment from DataDog slack:

right, we have a special API to report the list of services with their metadata (this is an internal API which might disappear at some point), it would still work if you don't implement it
that is where we assign to services an "app" and an "app_type" ; only used in the UI to represent a service (the icon in the service list / the text when hovering)

So I would suggest removing the part where we create services for DD and stick to sending spans only.

What is `service` supposed to be?

In the docs it just says "default service name" which I don't understand.

What does this setting actually do?

Add TraceContext struct, for distributed tracing

spandex-project/spandex_datadog#3 illustrates how we're just using a raw map to communicate data from the adapter to spandex. Instead, we should have distributed trace contexts hydrate a struct provided in core.

Docs for Spandex.Ecto.Trace disappeared

Hey!

The documented way of tracing Ecto was lost somewhere between v1.3 and v1.6... However, googling for Spandex Ecto still takes you to that outdated page. Would it be possible to describe or link to the proper way of tracing Ecto right in the main README?

Thanks for doing your great job!

Support a `strict` mode

Right now, the library does everything it can not to fail any operations whatsoever, at all. This can mean that updates to spans might fail if they are not valid, or that entire traces aren't started. This also happens silently, currently. We want some kind of strict mode that fails on any issue w/ any operation, or perhaps a log_errors configuration.

Remove `trace_key` configuration

We don't have a valid use case for each tracer having its own storage key, and in fact this makes using the library much harder than it should be.

Remove plugs from Spandex core

See the Plug docs for details, but libraries aren't supposed to pollute the assigns on the Conn.

https://github.com/elixir-plug/plug/blob/master/lib/plug/conn.ex#L69
https://github.com/elixir-plug/plug/blob/master/lib/plug/conn.ex#L84-L89

Separate adapters from core spandex

We should separate the datadog adapter into a separate repository.

Add the ability to use ETS to store trace data

We currently use the process dictionary, which is incompatible with certain designs. We should make both available as options.

Documentation could use 1 additional line at the beginning

1st questions when I opened this project "What does it do? What is it good for?" ;)

Umbrella application and decorator configuration

Hi!

First off, thanks for making this. I really appreciate it and I enjoyed your talk at the BEAM conference. We use an umbrella app and most of spandex is working; however, I noticed this line:

https://github.com/spandex-project/spandex/blob/master/lib/decorators.ex#L27

I am trying to figure out which OTP app would be chosen here in Application.get_env during compile time. I've moved the Datadog.ApiServer and my Tracer into a common application that is shared across my apps in the umbrella app.

Currently, I have the decorator set in every config.ex for each app but I was wondering if this is overkill or if you had any idea with OTP app is the one during compilation time for the decorators.

trace_id is not logged

I have the correct config for my logger:

config :logger, :console,
  format: "$time $metadata[$level] $message\n",
  metadata: [:request_id, :viewer_id, :trace_id, :span_id]

But the trace_id is not being logged:

Oct 23 15:51:51 dc1-live-appserver2 eggl[30255]: 15:51:51.298 request_id=iahoj1li85bhul2kf16m54lqsd11000p viewer_id=788210 span_id=4174
988920911713992 [info] QUERY OK source="valuations" db=1.5ms queue=0.1ms

Can't make it work.

Hey, i've spent the day trying to make this work. looked at the example(which is outdated, but i found a couple of places with new documentation and adapted). i dont see any logs. or exceptions. even when i enable verbose? mode.
Is there any place i can go to to receive assistance? i've even tried to debug the code to no avail :(

Give current_context a consistent return value

current_context should return a consistent value in all cases, that can be passed to continue_trace. With that value, continue_trace should be able to determine if its should start a trace at all and if so, if that trace should be a continuation of previous context or a new trace.

clarity around spans with spawned processes

hello,

Sorry for the generalness to this question. I'm having some trouble understanding how traces work in the context of many child processes. I understand that i can use Spandex.get_current_span and Spandex.continue_trace to copy the relevant data into the child processes dictionary, but i'm not sure if the continued trace should have a new name. My high-level problem is that we're running a large graphql API (absinthe) where some of the field resolvers run asynchronously via wrapper functions and some do not, and we'd like to use Spandex to make a flame graph where each span represents one resolver. Do you have any suggestions / thoughts about this? Does it make sense to add functions that just copy the process dictionary into a child process without requiring a name?

`distributed_context` takes a `Plug.Conn` rather than `headers()`

Currently there are two methods that Spandex provides to do distributed tracing.

inject_context:

  @spec inject_context(headers(), SpanContext.t(), Tracer.opts()) :: headers()

distributed_context:

  @spec distributed_context(Plug.Conn.t(), Tracer.opts()) ::
          {:ok, SpanContext.t()}
          | {:error, :disabled}

Given that these two methods are intended to be counterparts of each other, would it make more sense if the distributed_context interface was actually:

  @spec distributed_context(headers(), Tracer.opts()) ::
          {:ok, SpanContext.t()}
          | {:error, :disabled}

An example use case would be injecting and extracting tracing contexts from a GRPC request which does not rely on Plug.Conn.

Thoughts?

Can't get data sent to datadog

Hi,

I'm working with this library but I can't get information sent to datadog. Please tell me if there is any other diagnostic information I can provide or troubleshooting steps I've missed. I've set up the application as the README suggests and then when my application is running where the datadog agent is, I run the following commands in iex:

Tracer.start_trace("foo")
Tracer.update_span(service: :my_service, type: :web, resource: "/bar")
Tracer.finish_span()
Tracer.finish_trace()

What else I've tried

sending a sample request using curl on the same host using the datadog API guide (this works)

Source Files:

Thank you

Pass Trace between process boundaries

What's is the proper way to pass Trace between Elixir processes? I don't see public methods in Tracer interface how to obtain current Trace and Spandex module uses strategy.get_trace for start_span/update_span. Either I'm missing it or maybe you have some plans to implement it in future.

Explore removing `update_or_keep/2` from Spandex.Span

Right now, a lot of errors are ignored via usage of update_or_keep/2, which either updates the span and returns the new span, or doesn't update the span at all but returns the old span. This made some sense as a way to have consistent return types, but we're moving towards having more idiomatic return types, and that includes not swallowing errors. More discussion can be found in #63, where the issue was originally discovered. Additionally, that PR adds tests that can be un-skipped when the update functions have these new predictable return types.

Spandex processes dying

I am a bit at a loss where to start debugging this, so this is a bit of a broad ask for help. It seems that when we put some load on the Spandex GenServer it seems to crash from time to time causing spans to fail hundreds of thousands of times with messages like:

GenServer.call(#PID<0.934.1>, {:update, #Function<18.39421655/1 in SpandexDatadog.ApiServer.handle_call/3>}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

GenServer terminating:
exited in: GenServer.call(SpandexDatadog.ApiServer, {:send_trace, ... 30000)
    ** (EXIT) exited in: Task.await(%Task{owner: #PID<0.2917.0>, pid: #PID<0.9250.0>, ref: #Reference<0.698663207.2437152781.202107>}, 5000)
        ** (EXIT) time out

Now I cannot pinpoint a specific reason the Spandex GenServer itself crashes, but perhaps the timeout could simply be hit when there is a queue or something?

Any help/directions on how to debug this would be greatly appreciated.

Performance problems with Optimal dependency

I ran a profiler on my app using the eprof and fprof to identify bottlenecks to trace the execution of all functions in the code and report the time consumed with each.

Looking at the results 25% of the time was spent on Optimal, which is a lib responsible to validating code.

So I set up a benchmark script using Benchee to test the original code with the Optimal dependency and another one that overrides it with a simpler implementation.

The result was about 10% faster over the whole request without the Optimal dependency.

https://hexdocs.pm/mix/1.8.1/Mix.Tasks.Profile.Eprof.html
https://hexdocs.pm/mix/1.8.1/Mix.Tasks.Profile.Fprof.html

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
Number of Available Cores: 3
Available memory: 5.82 GB
Elixir 1.8.1
Erlang 21.3.7
Benchmark suite executing with the following configuration:
warmup: 0 ns
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 15 s
Benchmarking 0_warmup...
Benchmarking 1_normal...
Benchmarking 2_optimal_overridden...
warning: redefining module Optaimal (current version loaded from _build/test/lib/optimal/ebin/Elixir.Optimal.beam)
  priv/myapp_web/script_optimal.exs:23
Name                           ips        average  deviation         median         99th %
2_optimal_overridden        270.69        3.69 ms    ±33.85%        3.26 ms        7.91 ms
1_normal                    248.06        4.03 ms    ±35.16%        3.58 ms        9.72 ms
0_warmup                    129.54        7.72 ms  ±1258.76%        3.45 ms       11.21 ms
Comparison:
2_optimal_overridden        270.69
1_normal                    248.06 - 1.09x slower +0.34 ms
0_warmup                    129.54 - 2.09x slower +4.03 ms

Logging Toggle

Spandex needs to be able to toggle logging. The logger can be a big bottleneck in high traffic applications, and logging in erroneous circumstances should be opt in.

Failed to run Custom Tracing example

ENVIRONMENT

spandex 1.6.1
datadog agent 6.3.3

Hey! Phoenix/Plug works fine for me, but I can not do custom tracing:

defmodule Core.Tracer do
  @moduledoc """
  DataDog tracer
  """
  use Spandex.Tracer, otp_app: :dice
end

defmodule ManuallyTraced do
  alias Core.Tracer
  require Core.Tracer

  # Does not handle exceptions for you.
  def trace_me() do
    _ = Tracer.start_trace("my_trace") #also opens a span
    _ = Tracer.update_span(service: :ecto, type: :db, sql_query: [query: "SELECT * FROM posts", rows: "10"])

    :timer.sleep(1000)
    _ = Tracer.finish_trace()
  end
end

ManuallyTraced.trace_me()

Datadog Agent rejects this trace:

[ TRACE ] 2018-08-01 19:35:16 ERROR (receiver.go:250) - dropping trace reason: invalid span service:"ecto" name:"my_trace" traceID:7629694493567207627 spanID:6427408010887947325 start:1533152115901223000 du...

Am I missing some required params in my example?

Spandex.finish_{span,trace} don't update traces before completing

Maybe this was intentional, but while I was working on tests for the functions in the Spandex module, I noticed that if you pass in a completion_time when you call Spandex.finish_span, it is ignored. If you call Spandex.update_span with that option just before calling Spandex.finish_span, then it is updated. We should think about whether it makes sense to warn/log/error when you pass in unsupported opts that are going to get ignored.

Implement trace sampling, and perhaps emergency load shedding

We want to make sure that Spandex can be used with large scale implementations, and to do so we need to ensure that sampling is implemented natively.

Use config values everywhere, not just plugs

Currently the only thing that uses the configured defaults are the plugs, not the actual trace/span constructors. We should resolve that so that people using things like the traced decorator or manually constructing traces can take advantage of setting defaults.

Trace batching

While testing distributed tracing, I noticed that traces were only getting sent when the threshold number of traces had been stored up. This means that if you have very low traffic for a period of time, (i.e. number of requests is less than batch_size), the traces won't get sent out.

I can understand that one solution to this is probably to tune that number to be lower (if not 1? eek), but one alternative I had in mind was some sort of timeout which triggered sending batches.

Keen on your thoughts around such an idea?

How to properly configure host and port?

Hi there,

I'm trying to test this library as replacement to a custom monitoring that uses Elixometer/Exometer + Dogstatsd, but I'm confused about how to configure Datadog here.

Should I have the Datadog Agent running locally and set host to localhost and port to 8126 as the example in the docs does?
I have tried this and I keep getting:

[debug] Trace response: {:error, %HTTPoison.Error{id: nil, reason: :econnrefused}}

Based on the error it looks like the library is trying to do HTTP calls, so maybe it's trying to use Datadog's HTTP API instead of the Datadog Agent? If so, where should I configure my Datadog API key? And what should be the appropriate host and port?

Thanks in advance!

Decorators might not be respecting `disabled?`

We recently added decorators to our project. Haven't actually gotten it working yet (overlapping traces from ecto or phx?) but that's not related to this issue so excuse me!
Decorators were causing tests failures until we manually added
Relay.Tracer.configure(disabled?: true) to test_helper.exs.
This is our config:

config :relay, Relay.Tracer,
  service: :relay,
  adapter: SpandexDatadog.Adapter,
  type: :custom,
  disabled?: Mix.env() != :prod,
  env: release_level

config :spandex, :decorators, tracer: Relay.Tracer

config :spandex_phoenix,
  service: :phoenix,
  type: :web,
  tracer: Relay.Tracer

config :spandex_ecto, SpandexEcto.EctoLogger,
  service: :ecto,
  type: :db,
  tracer: Relay.Tracer

Tags should not be required to be a keyword list, and should support nesting

Ultimately, these get encoded back into a map and sent to the collector, so there's no reason to enforce that they be a Keyword list (i.e. atom keys).

For example, it would be nice to be able to include some map of key-value pairs that were decoded from a JSON payload, but you don't want to convert all the keys to atoms. Also in this scenario, it would be nice to be able to have nested maps as values. On a Datadog-specific note: I believe this is possible, but we'd have to traverse the maps and change the names to a flat structure of dot-syntax keys like parent.child.key: value.

Spandex.update_span unexpectedly resets :service of the current span

Reproduced in v1.6.1 and edge version from Github.

Let's assume the following:

# config/dev.exs
config :my_app, MyApp.Tracer,
  service: :my_app,

# in the module I'm tracing
@tracer_opts [service: :etl, type: :custom, resource: "MyApp.SalesStatsETL"]

Example 1

MyApp.Tracer.trace "trace_name", @tracer_opts do ... end

OUTCOME: in Datadog there's a trace with service="etl", which is expected 👍

Example 2

MyApp.Tracer.trace "trace_name", @tracer_opts do 
  MyApp.Tracer.update_span(
     sql_query: [query: "SELECT ..", db: "some_db", rows: "42"]
  )
end

OUTCOME: in Datadog there's a trace with service="my_app" 👎
EXPECTED OUTCOME service should remain "etl" since it was not overridden explicitely

Example 3

MyApp.Tracer.trace "trace_name", @tracer_opts do 
  %{service: service} = Core.Tracer.current_span()
  MyApp.Tracer.update_span(
     sql_query: [query: "SELECT ..", db: "some_db", rows: "42"],
     service: service
  )
end

OUTCOME: in Datadog there's a trace with service="etl" 👍

Customize environment name

Looks like the environment variable name posted to Datadog is always set as "env" in the sense of when browsing through DataDog's APM screens it is possible to filter for "env". The point here is that I have many other applications posting the "env" as "environment" thus I end up having 2 filters: env and environment.

Is there a way to customize the "env" var name to "environment"?

In the snippet below, if I try to change env and name it as environment I get an error since environment is not a recognized parameter...

Tracer.configure( service: :booking, adapter: SpandexDatadog.Adapter, disabled?: tracing_disabled, env: "#{apm_tracing_env}" )

Log message does not correlate with the trace when trace_id and spand_id are included

What's now:

Having the following lines in config.exs

config :logger, :console,
  format: "$time $metadata[$level] $message\n",
  metadata: [:request_id, :trace_id, :span_id]

The trace produced with no logs associated

same time the trace id and span id are included in the log message.

Seems the format of log message is wrong, it should be dd.trace_id and dd.span_id according to the docs

What's expected

Log record is correlated and visible on the trace page.

Consider switching to phoenix instrumentation instead of plugs

This might not be worth it, but would be more standard.

Support Service Versions

Note: Initially, I wanted to open this issue in spandex_datadog, but for reasons I'll describe in a second I'm opening it here.

Most Datadog tracing libraries support specifying the version of the service being traced, in order to track deployments. Usually, this is done with an environment variable like DD_VERSION, where the version is a string like "2.21" or "a7b91d". The feature allows us to track regressions in performance, monitor canary releases - neat stuff. We at Fresha would definitely use it, so I had the thought of implementing it in spandex_datadog - until I checked how others do it.

When I looked at other tracing standards like OpenTelemetry - it appears there's also a notion of a service.version in them. That's when I thought that it would actually make sense to specify the service version around the same time when specifying the service name - either during spandex tracer configuration or when tracing.

My suggestion is to extend the tracer config and the options allowed in tracer functions so that we can pass a service version along with the service name. We could then propagate those in Span.t() until they reach spandex_datadog, where they would get converted to span metadata.

If the idea makes sense, I'll be happy to implement it.

Rework abstraction to prepare for more adapters

Currently, we break from the abstract at a high level, basically just defining the interface of a tracer. I'm getting started adding a google stackdriver tracing adapter, and what I'm realizing is that pretty much all of the logic stays the same. The only thing that is different is ApiServer, and a few small things (like trace_id/span_ids). So here is the proposition: We make most of the Datadog adapter code only adapter code, and make Spandex.Datadog.Span just Spandex.Span. Then, we define an adapter in terms of the small things it needs to change about that process.

defmodule Spandex.Adapters.Adapter do
  @moduledoc """
  This isn't all the callbacks, just a starting point.
  """
  @callback new_id() :: term
  @callback now() :: term
  @callback send_trace() :: term
end

@driv3r @asummers What do you guys think?

Spandex Decorators seem to be causing some problems during `mix release`

👋
I realise that the title might be a bit misleading and there is a chance that the problem lies elsewhere, but I hope you can assist me in this.

Environment:

Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1]

Elixir 1.9.4 (compiled with Erlang/OTP 22)

Relevant parts of my mix.exs file:

{:spandex, "~> 2.4"},
      {:spandex_datadog, "~> 0.4"},
      {:spandex_phoenix, "~> 0.3"},
      {:spandex_ecto, "~> 0.4"},
      {:decorator, "~> 1.3"}

In short: I inherited a phoenix app and I have to add some Datadog tracing. I've added everything required, following the docs + a small custom module to use in every module that I would like to be traced:

defmodule MyService.Tracer do

  # All the required default tracer stuff here

  defmodule ModuleTracer do
    defmacro __using__(_opts) do
      quote do
        use Spandex.Decorators

        @decorate_all span()
      end
    end
  end
end

And the intended usage is:

defmodule MyService.SomeModule do
  # I want to span everything happening in this module
  use MyService.Tracer.ModuleTracer
end

Now whenever I build my docker image I always get an error on the mix release step:

== Compilation error in file lib/my_service/some_module.ex ==
** (CompileError) lib/my_service/some_module.ex:1: module nil is not loaded and could not be found
    (stdlib) erl_eval.erl:680: :erl_eval.do_apply/6
    /lib/my_service/some_module.ex:1: Decorator.Decorate.before_compile/1

Some quick notes:

The error is always in the module that "uses" the ModuleTracer.
If I don't "use" it anywhere the mix release is successful and the build succeeds.

Am I trying to do something stupid here? That's the first time I am using the spandex-project so maybe I am messing up something.
Would really appreciate your help, thank you :)

spandex-project / spandex Goto Github PK

spandex's People

Stargazers

Watchers

Forkers

spandex's Issues

What's now:

What's expected

Recommend Projects

Recommend Topics

Recommend Org