Giter Club home page Giter Club logo

Comments (13)

rfratto avatar rfratto commented on August 19, 2024

10 GB definitely seems like a lot. I suspect this might be caused by Go >=1.12's behavior of using MADV_FREE for memory management, which leads to a higher reported RSS size even though a portion of it is available to be reclaimed by the kernel when it needs to. You can revert back to the 1.11 behavior by setting the GODEBUG environment variable to madvdontneed=1, which gives a more accurate RSS size.

Checking the go_memstats_heap_inuse_bytes metric from the Agent will give a clearer picture here on how much memory is actively being used. Dividing that by agent_wal_storage_active_series will also help you find out the average bytes per series; we tend to see around 9KB for our 20 agents.

If you look at go_memstats_heap_inuse_bytes and it's still unexpectedly high, it would be useful to share a heap profile. You can generate that by wgeting /debug/pprof/heap against the Agent's http server.

from agent.

rfratto avatar rfratto commented on August 19, 2024

In my testing, most of the memory usage tends to boil down to:

a) how many series exist in a single scraped target
b) the length of label names and values
c) the number of active series

If you have one target with a significant number of series, the memory allocated for scraping that target sticks around in a pool even when the target isn't being actively scraped.

Likewise, even though strings are being interned, having mostly long label names and values will negatively influence the memory usage, as will a significant amount of series.

from agent.

yywandb avatar yywandb commented on August 19, 2024

Thanks for your help @rfratto!

We tried to set the GODEBUG env variable as you said, however the same issue persists (memory spikes and then pod OOMs as we have set a memory limit). Does this indicate that it's not a memory management issue?

This is what the size per series looks like (I added the env variable at 16:54):
image

Looking at the heap profile from an agent with the issue and an agent without:

remote-write-agent3 (with the memory spiking problem)
image

remote-write-agent0 (no problems)
image

Here are the heap dumps:
heapdump.zip

Does anything look suspicious to you in those heapdumps?

from agent.

yywandb avatar yywandb commented on August 19, 2024

Update:
go_memstats_heap_inuse_bytes metrics
image

container_memory_rss using cadvisor metrics
image

After updating the agents with the env variable at 16:54, for remote-write-agent-3, there was a spike in the memory usage followed by it being OOMKilled at 17:07. However, it seems that the memory usage of remote-write-agent-3 is stable now, but we are beginning to see the memory usage of remote-write-usage-2 creep up.

Do you think that this indicates that there's probably a scrape target with large metrics (e.g. long labels) that has moved between the agents? We shard the scrape targets based on pod name, so it's possible that the pod name has changed, causing the scrape target under a different pot name to be scraped by a different agent.

from agent.

rfratto avatar rfratto commented on August 19, 2024

Thanks for providing the heap dumps! Unfortunately nothing seems out of the ordinary; most of the memory usage is coming in from building labels from the relabeling process (which I assume is because of using hashmod for all the series).

From what I can tell, I agree that this seems like there's just a giant target that's moving between agents and pushing your pod beyond its memory limits. How many active series were on remote-write-agent-3 before and after the memory usage moved to -2?

I've reached out to my team to see if we can get a second pair of eyes on this just in case I'm missing something here.

from agent.

yywandb avatar yywandb commented on August 19, 2024

Yeah, the # active series seems like the prime suspect here.

Number of active series (same pattern as memory usage):
image

Number of samples (seems to be around the same across agents):
image

It seems like there's probably a target with a high number of unique metric series that's moving between the agents.

Do you know what's the best way to analyze which targets have the most number of active series? I saw this blog post about analyzing cardinality and churn for Prometheus. I'm wondering if we can do something similar for the remote write agent.

from agent.

rfratto avatar rfratto commented on August 19, 2024

That information isn't exposed yet in the Agent unfortunately. Prometheus can expose information about targets and series per target in its scraper, but I'm not using it yet (I had planned to expose it in an API that I vaguely described in #6).

Short term, the easiest way for you to find out the problematic target might be to hack on the Agent to print out that metadata. Off the top of my head, I'd suggest adding a goroutine in the instance code that polls the scrape manager and logs out targets with a high number of series. You could then build a new image by running RELEASE_BUILD=1 make agent-image.

from agent.

yuecong avatar yuecong commented on August 19, 2024

Thanks @rfratto. will give a try to see. btw, curious why we still need to keep active queries if it is a pure remote write? Is this one prometheus limitation? :)

from agent.

rfratto avatar rfratto commented on August 19, 2024

Hi @yuecong, I'm not sure what you mean be needing to keep active queries; could you explain a little more?

from agent.

yuecong avatar yuecong commented on August 19, 2024

Say I have M metrics scraped from all the targets for one agent, let us say this is about 15K.
And some of the metrics scraped have a high cardinality and the label value is changing over time per scrape, and this will cause active queries much higher than # of metrics scraped each time. I think this is likely what is happening in our system.

So, I am thinking about whether we could not care about the cardinality for each metric in the agent, but just call the remote write API to the storage backend so that the agent will not suffer for the high cardinality. But I agree that the storage backend will suffer from it for sure. :)

Hope this is more clear.

from agent.

cstyan avatar cstyan commented on August 19, 2024

@yuecong Because of the way Prometheus' WAL and remote write system are designed, which is what the agent is based on top of, there's no way to not care about active series.

The WAL has some record types, but the ones we care about for remote write are series records and sample records. Series records are written when a new series starts (or in checkpoints if a series is still active) and contain a reference ID and the labels for that series. Sample records only contain the reference ID and the current timestamp/value for that series. In order for remote write to be more reliable via reusing the WAL, the remote write system has to cache the data it reads from series records. Series churn over short periods of time will lead to increased memory usage with remote write.

from agent.

yywandb avatar yywandb commented on August 19, 2024

Thanks for your help @cstyan. We found the target with the many metrics. We added an external label for the shard number of each agent so that we could query to see which targets were scraped by each agent and narrowed down from there.

Moving forward, we're setting up a dedicated agent for that target with higher memory requests.

Thanks again!

from agent.

yuecong avatar yuecong commented on August 19, 2024

Thanks @cstyan and @rfratto ! closing this issue.

from agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.