We use hashmod to shard the agents to remote write to

Thanks for your help <a class="user-mention notranslate" data-hovercard-type="user" da

Update: go_memstats_heap_inuse_bytes metrics <a target="_blank" rel="noopener

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Agent use high memory even metrics scraped is not with a high volume about agent HOT 13 CLOSED

grafana commented on August 19, 2024

Agent use high memory even metrics scraped is not with a high volume

from agent.

Comments (13)

rfratto commented on August 19, 2024

10 GB definitely seems like a lot. I suspect this might be caused by Go >=1.12's behavior of using MADV_FREE for memory management, which leads to a higher reported RSS size even though a portion of it is available to be reclaimed by the kernel when it needs to. You can revert back to the 1.11 behavior by setting the GODEBUG environment variable to madvdontneed=1, which gives a more accurate RSS size.

Checking the go_memstats_heap_inuse_bytes metric from the Agent will give a clearer picture here on how much memory is actively being used. Dividing that by agent_wal_storage_active_series will also help you find out the average bytes per series; we tend to see around 9KB for our 20 agents.

If you look at go_memstats_heap_inuse_bytes and it's still unexpectedly high, it would be useful to share a heap profile. You can generate that by wgeting /debug/pprof/heap against the Agent's http server.

from agent.

rfratto commented on August 19, 2024

In my testing, most of the memory usage tends to boil down to:

a) how many series exist in a single scraped target
b) the length of label names and values
c) the number of active series

If you have one target with a significant number of series, the memory allocated for scraping that target sticks around in a pool even when the target isn't being actively scraped.

Likewise, even though strings are being interned, having mostly long label names and values will negatively influence the memory usage, as will a significant amount of series.

from agent.

yywandb commented on August 19, 2024

Thanks for your help @rfratto!

We tried to set the GODEBUG env variable as you said, however the same issue persists (memory spikes and then pod OOMs as we have set a memory limit). Does this indicate that it's not a memory management issue?

This is what the size per series looks like (I added the env variable at 16:54):

Looking at the heap profile from an agent with the issue and an agent without:

remote-write-agent3 (with the memory spiking problem)

remote-write-agent0 (no problems)

Here are the heap dumps:
heapdump.zip

Does anything look suspicious to you in those heapdumps?

from agent.

yywandb commented on August 19, 2024

Update:
go_memstats_heap_inuse_bytes metrics

container_memory_rss using cadvisor metrics

After updating the agents with the env variable at 16:54, for remote-write-agent-3, there was a spike in the memory usage followed by it being OOMKilled at 17:07. However, it seems that the memory usage of remote-write-agent-3 is stable now, but we are beginning to see the memory usage of remote-write-usage-2 creep up.

Do you think that this indicates that there's probably a scrape target with large metrics (e.g. long labels) that has moved between the agents? We shard the scrape targets based on pod name, so it's possible that the pod name has changed, causing the scrape target under a different pot name to be scraped by a different agent.

from agent.

rfratto commented on August 19, 2024

Thanks for providing the heap dumps! Unfortunately nothing seems out of the ordinary; most of the memory usage is coming in from building labels from the relabeling process (which I assume is because of using hashmod for all the series).

From what I can tell, I agree that this seems like there's just a giant target that's moving between agents and pushing your pod beyond its memory limits. How many active series were on remote-write-agent-3 before and after the memory usage moved to -2?

I've reached out to my team to see if we can get a second pair of eyes on this just in case I'm missing something here.

from agent.

yywandb commented on August 19, 2024

Yeah, the # active series seems like the prime suspect here.

Number of active series (same pattern as memory usage):

Number of samples (seems to be around the same across agents):

It seems like there's probably a target with a high number of unique metric series that's moving between the agents.

Do you know what's the best way to analyze which targets have the most number of active series? I saw this blog post about analyzing cardinality and churn for Prometheus. I'm wondering if we can do something similar for the remote write agent.

from agent.

rfratto commented on August 19, 2024

That information isn't exposed yet in the Agent unfortunately. Prometheus can expose information about targets and series per target in its scraper, but I'm not using it yet (I had planned to expose it in an API that I vaguely described in #6).

Short term, the easiest way for you to find out the problematic target might be to hack on the Agent to print out that metadata. Off the top of my head, I'd suggest adding a goroutine in the instance code that polls the scrape manager and logs out targets with a high number of series. You could then build a new image by running RELEASE_BUILD=1 make agent-image.

from agent.

yuecong commented on August 19, 2024

Thanks @rfratto. will give a try to see. btw, curious why we still need to keep active queries if it is a pure remote write? Is this one prometheus limitation? :)

from agent.

rfratto commented on August 19, 2024

Hi @yuecong, I'm not sure what you mean be needing to keep active queries; could you explain a little more?

from agent.

yuecong commented on August 19, 2024

Say I have M metrics scraped from all the targets for one agent, let us say this is about 15K.
And some of the metrics scraped have a high cardinality and the label value is changing over time per scrape, and this will cause active queries much higher than # of metrics scraped each time. I think this is likely what is happening in our system.

So, I am thinking about whether we could not care about the cardinality for each metric in the agent, but just call the remote write API to the storage backend so that the agent will not suffer for the high cardinality. But I agree that the storage backend will suffer from it for sure. :)

Hope this is more clear.

from agent.

cstyan commented on August 19, 2024

@yuecong Because of the way Prometheus' WAL and remote write system are designed, which is what the agent is based on top of, there's no way to not care about active series.

The WAL has some record types, but the ones we care about for remote write are series records and sample records. Series records are written when a new series starts (or in checkpoints if a series is still active) and contain a reference ID and the labels for that series. Sample records only contain the reference ID and the current timestamp/value for that series. In order for remote write to be more reliable via reusing the WAL, the remote write system has to cache the data it reads from series records. Series churn over short periods of time will lead to increased memory usage with remote write.

from agent.

yywandb commented on August 19, 2024

Thanks for your help @cstyan. We found the target with the many metrics. We added an external label for the shard number of each agent so that we could query to see which targets were scraped by each agent and narrowed down from there.

Moving forward, we're setting up a dedicated agent for that target with higher memory requests.

Thanks again!

from agent.

yuecong commented on August 19, 2024

Thanks @cstyan and @rfratto ! closing this issue.

from agent.

Agent use high memory even metrics scraped is not with a high volume about agent HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent