projecteru2 / agent Goto Github PK
View Code? Open in Web Editor NEWEru agent is a agent running on each node, provide monitoring and expose metrics. Support container and virtual machine.
Home Page: https://eru.dev
License: MIT License
Eru agent is a agent running on each node, provide monitoring and expose metrics. Support container and virtual machine.
Home Page: https://eru.dev
License: MIT License
eru-agent[213677]: time="2021-04-07T09:50:33+08:00" level=info msg="[Watch] Monitor: cid f5b10cf action attach"
eru-agent[213677]: time="2021-04-07T09:50:33+08:00" level=info msg="[Watch] Monitor: cid f5b10cf action attach"
eru-agent[213677]: panic: duplicate metrics collector registration attempted
eru-agent[213677]: goroutine 91514882 [running]:
eru-agent[213677]: github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0xc0001345f0, 0xc014c3c500, 0x13, 0x13)
eru-agent[213677]: /home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:401 +0xad
eru-agent[213677]: github.com/prometheus/client_golang/prometheus.MustRegister(...)
eru-agent[213677]: /home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:178
eru-agent[213677]: github.com/projecteru2/agent/engine.NewMetricsClient(0x0, 0x0, 0x7ffe6dbd2f7b, 0x6, 0xc03ff3c370, 0x1)
eru-agent[213677]: /home/runner/work/agent/agent/engine/metrics.go:165 +0x210d
eru-agent[213677]: github.com/projecteru2/agent/engine.(*Engine).stat(0xc0001c86c0, 0x1401e00, 0xc054a310c0, 0xc03ff3c370)
eru-agent[213677]: /home/runner/work/agent/agent/engine/stat.go:41 +0x2e5
eru-agent[213677]: created by github.com/projecteru2/agent/engine.(*Engine).attach
eru-agent[213677]: /home/runner/work/agent/agent/engine/attach.go:65 +0x632
systemd[1]: [email protected]: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: [email protected]: Unit entered failed state.
systemd[1]: [email protected]: Failed with result 'exit-code'.
assume there is a container run in private network binding port 8001, as container dies, container ip address would be empty string, remaining health check TCP net location as something like :8001
.
in the case that there's another container running on that node with host network, binding 8001 port as well, health check for the dead one will pass, and core will mark the dead container with status running=false
and healthy=true
.
time="2020-08-20T07:21:32Z" level=info msg="[EruResolver] start sync service discovery"
time="2020-08-20T07:21:32Z" level=info msg="[logServe] Log monitor started"
time="2020-08-20T07:21:32Z" level=fatal msg="rpc error: code = Unknown desc = bad Count
value: key: /node/370e66f31bc8"
now mem_percent used memory.usage_in_bytes/total_memory to get the percent.
In some case, we need mem_rss_percent,mem_rss/total_memory to get the rss usage.
the smallest reproduce code:
package main
import (
"encoding/json"
"os"
"time"
)
type C struct {
m map[string]string
}
func main() {
c := C{
m: map[string]string{"a": "b"},
}
go func() {
for {
json.Marshal(c.m)
}
}()
go func() {
time.Sleep(1 * time.Second)
delete(c.m, "a")
}()
sigs := make(chan os.Signal, 1)
<-sigs
}
Since it's a race condition that we can't ensure it'll take place every time, we can still capture the panic in several times of try.
/go/src/tmp/panic # ./agent_panic
panic: reflect: call of reflect.Value.Type on zero Value [recovered]
panic: reflect: call of reflect.Value.Type on zero Value
goroutine 6 [running]:
encoding/json.(*encodeState).marshal.func1(0xc000056f38)
/usr/local/go/src/encoding/json/encode.go:328 +0x8d
panic(0x4c02e0, 0xc000326858)
/usr/local/go/src/runtime/panic.go:965 +0x1b9
reflect.Value.Type(0x0, 0x0, 0x0, 0xc000321170, 0x0)
/usr/local/go/src/reflect/value.go:1908 +0x189
encoding/json.stringEncoder(0xc00009e080, 0x0, 0x0, 0x0, 0xc000320100)
/usr/local/go/src/encoding/json/encode.go:620 +0x4c
encoding/json.mapEncoder.encode(0x4e08e8, 0xc00009e080, 0x4c0d00, 0xc00005e150, 0x15, 0x4c0100)
/usr/local/go/src/encoding/json/encode.go:813 +0x366
encoding/json.(*encodeState).reflectValue(0xc00009e080, 0x4c0d00, 0xc00005e150, 0x15, 0xc000050100)
/usr/local/go/src/encoding/json/encode.go:360 +0x82
encoding/json.(*encodeState).marshal(0xc00009e080, 0x4c0d00, 0xc00005e150, 0x100, 0x0, 0x0)
/usr/local/go/src/encoding/json/encode.go:332 +0xf9
encoding/json.Marshal(0x4c0d00, 0xc00005e150, 0xc0003157a0, 0x9, 0x10, 0x0, 0x0)
/usr/local/go/src/encoding/json/encode.go:161 +0x52
main.main.func1(0xc00005e150)
/go/src/tmp/panic/main.go:21 +0x37
created by main.main
/go/src/tmp/panic/main.go:19 +0x9f
And this is exactly the same panic log @jason-joo reported to me.
There is a flexible data container in each logging line when forwarding logging to journald
: Extra
.
But currently it's unusable:
Define what should be filled into and how it's encoded.
{
"Fields": {
"DATE_TIME": "2021-04-29 18:02:39.296602",
"ENTRY_POINT": "test",
"EXTRA": "map[]",
"ID": "11bedb8f3d35791c76fa68cf1fedf09558a8a95912326d3db55139b90e6bef27",
"IDENT": "PewtMl",
"MESSAGE": "message 64 bytes from 127.0.0.1: seq=49 ttl=64 time=0.071 ms",
"PRIORITY": "3",
"SYSLOG_IDENTIFIER": "test",
"TYPE": "stdout",
"_BOOT_ID": "b403cea00ac94ef8a2fc4f0c933a9431",
"_CAP_EFFECTIVE": "3fffffffff",
"_CMDLINE": "/usr/local/bin/eru-agent --config /etc/eru/agent.yaml",
"_COMM": "eru-agent",
"_EXE": "/usr/local/bin/eru-agent",
"_GID": "0",
"_HOSTNAME": "test01",
"_MACHINE_ID": "80be556a18e047711dc7d96215a20528",
"_PID": "148385",
"_SOURCE_REALTIME_TIMESTAMP": "1619690559296662",
"_SYSTEMD_CGROUP": "/system.slice/eru-agent.service",
"_SYSTEMD_SLICE": "system.slice",
"_SYSTEMD_UNIT": "eru-agent.service",
"_TRANSPORT": "journal",
"_UID": "0"
},
"Cursor": "s=5faedb7721d34f189ec9899c5b88cbdc;i=c9763;b=b403cea00ac94ef8a2fc4f0c933a9431;m=1a9e416ab3;t=5c119986ed59d;x=140dfabcc0f557ce",
"RealtimeTimestamp": 1619690559296925,
"MonotonicTimestamp": 114324236979
}
when I upgrade one container from 4GB to 6GB,
eru agent still get the 4GB as the mem_max_usage metrics, cause the mem_percent shows wrong data:
mem_max_usage{appname=""} 4.66870272e+09
mem_percent{appname=""} 1.0853919982910156
docker attach 会打开一个 StreamConfig 对象 管理容器 IO 的转发,那么在 attach 客户端阻塞的情况下, StreamConfig 的 copyIO goroutine 会阻塞, 导致 Wait() 阻塞, 导致容器处理生命周期时阻塞, 所以容器内的进程已经被杀但是容器的状态依然是 running, 而 docker stop 就阻塞在发出 SIGKILL 后, 等待容器状态变为 NotRunning
尝试复现. 一开始我是把 filebeat 容器 stop 了, 把 agent 容器也 stop, 然后自己把 agent 源码搞上去用 dlv 单步调试, 可以复现, 不过有更简单的办法: 把 filebeat 和 agent stop 了, pip install docker, 在 python repl 里 attach 一个快速打日志的容器, 但是不去处理 attach, 过一会儿(<2min) docker logs --tail 1 -f 看日志就会发现被 attach 的容器已经不输出日志了, 进程阻塞了, 这时候再去 docker stop / kill 都会卡住, 成功复现。
分析 docker 源码, 大体上已经在上面说了, 不过细节是这样的:
从各种意义上来讲这玩意儿太他妈邪门
还有 attach 时开启 log flag 吧
There should be some changes in the configuration for agent including the container filtering feature. But we don't have it in the sample yaml, both for template configuration and document.
while container stops, eru needs to update running field no matter the status of health check
Assume the health check interval to T
.
Assume the initial health check failed once receiving the started event from docker.
In the current implementation, the second chance to report its health will be after T
time.
While in some large IDCs we have a significant long interval (T) and it could be more than 100 seconds.
When health check booted it may make more sense to proceed with a backoff interval up to T with initial unhealth.
Though it's marked as possible but it really happened.
A race on map occurred and it caused panics and termination. Logs are attached below.
health_check.go:57
checkOneContainer() could be invoked asynchronously:
And the labels
map is shared between different invocations.
Generally speaking there are two ways to solve this:
In this case the second solution is preferred personally. Because to check the same container concurrently doesn't make sense. So maybe introduce some kinds of locking in checkOneContainer()
to skip duplicated checks is a simple way(Just for suggestions).
Feb 9 11:03:58 dev01 eru-agent[13787]: time="2021-02-09T11:03:58+08:00" level=info msg="[Watch] Monitor: cid e1851e2 action start"
Feb 9 11:03:58 dev01 eru-agent[13787]: time="2021-02-09T11:03:58+08:00" level=info msg="[attach] attach redisproxy container e1851e2 success"
Feb 9 11:03:58 dev01 eru-agent[13787]: fatal error: concurrent map iteration and map write
Feb 9 11:03:58 dev01 eru-agent[13787]: goroutine 26264 [running]:
Feb 9 11:03:58 dev01 eru-agent[13787]: runtime.throw(0x12942cf, 0x26)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/panic.go:1116 +0x72 fp=0xc000141aa0 sp=0xc000141a70 pc=0x437a92
Feb 9 11:03:58 dev01 eru-agent[13787]: runtime.mapiternext(0xc000474060)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/map.go:853 +0x554 fp=0xc000141b20 sp=0xc000141aa0 pc=0x410d34
Feb 9 11:03:58 dev01 eru-agent[13787]: reflect.mapiternext(0xc000474060)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/map.go:1337 +0x2b fp=0xc000141b38 sp=0xc000141b20 pc=0x469c8b
Feb 9 11:03:58 dev01 eru-agent[13787]: reflect.Value.MapKeys(0x10ca180, 0xc0005642a0, 0x15, 0x0, 0xc000141c18, 0x40429a)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/reflect/value.go:1227 +0x10c fp=0xc000141bc8 sp=0xc000141b38 pc=0x4a02ac
Feb 9 11:03:58 dev01 eru-agent[13787]: encoding/json.mapEncoder.encode(0x12c2d58, 0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x15, 0x10c0100)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:785 +0xff fp=0xc000141d40 sp=0xc000141bc8 pc=0x55351f
Feb 9 11:03:58 dev01 eru-agent[13787]: encoding/json.mapEncoder.encode-fm(0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x15, 0x1b00100)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:777 +0x65 fp=0xc000141d80 sp=0xc000141d40 pc=0x55fbe5
Feb 9 11:03:58 dev01 eru-agent[13787]: encoding/json.(*encodeState).reflectValue(0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x15, 0xc000140100)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:358 +0x82 fp=0xc000141db8 sp=0xc000141d80 pc=0x5508c2
Feb 9 11:03:58 dev01 eru-agent[13787]: encoding/json.(*encodeState).marshal(0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x880100, 0x0, 0x0)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:330 +0xf4 fp=0xc000141e18 sp=0xc000141db8 pc=0x5504b4
Feb 9 11:03:58 dev01 eru-agent[13787]: encoding/json.Marshal(0x10ca180, 0xc0005642a0, 0x40, 0x106d340, 0xc00068a0b0, 0x45d964b800, 0x4b)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:161 +0x52 fp=0xc000141e90 sp=0xc000141e18 pc=0x54faf2
Feb 9 11:03:58 dev01 eru-agent[13787]: github.com/projecteru2/agent/store/core.(*CoreStore).SetContainerStatus(0xc000509a40, 0x13f25c0, 0xc00003c658, 0xc000256370, 0xc000416000, 0x0, 0x0)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/store/core/container.go:29 +0x6d fp=0xc000141f78 sp=0xc000141e90 pc=0xb917ed
Feb 9 11:03:58 dev01 eru-agent[13787]: github.com/projecteru2/agent/engine.(*Engine).checkOneContainer(0xc0003fb1a0, 0xc000256370)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/engine/health_check.go:68 +0x77 fp=0xc000141fd0 sp=0xc000141f78 pc=0xbbd517
Feb 9 11:03:58 dev01 eru-agent[13787]: runtime.goexit()
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000141fd8 sp=0xc000141fd0 pc=0x470121
Feb 9 11:03:58 dev01 eru-agent[13787]: created by github.com/projecteru2/agent/engine.(*Engine).handleContainerStart
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/engine/monitor.go:52 +0x2cf
Feb 9 11:03:58 dev01 eru-agent[13787]: goroutine 1 [select, 1115 minutes]:
Feb 9 11:03:58 dev01 eru-agent[13787]: runtime.gopark(0x12c65c8, 0x0, 0x1809, 0x1)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/proc.go:306 +0xe5 fp=0xc00018fa68 sp=0xc00018fa48 pc=0x43a685
Feb 9 11:03:58 dev01 eru-agent[13787]: runtime.selectgo(0xc00018fc28, 0xc00018fbd0, 0x2, 0x1, 0x1)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/select.go:338 +0xcef fp=0xc00018fb90 sp=0xc00018fa68 pc=0x44a7ef
Feb 9 11:03:58 dev01 eru-agent[13787]: github.com/projecteru2/agent/engine.(*Engine).Run(0xc0003fb1a0, 0x13f2580, 0xc00020cec0, 0x0, 0x0)
Feb 9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/engine/engine.go:107 +0x239 fp=0xc00018fc88 sp=0xc00018fb90 pc=0xbbcaf9
Feb 9 11:03:58 dev01 eru-agent[13787]: main.serve(0xc00020c3c0, 0x0, 0x0)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.