Giter Club home page Giter Club logo

agent's People

Contributors

anrs avatar cmgs avatar duodenuml avatar jason-joo avatar jschwinger233 avatar timfeirg avatar tonicbupt avatar wrfly avatar yuyang0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

agent's Issues

agent health checking on dead container

assume there is a container run in private network binding port 8001, as container dies, container ip address would be empty string, remaining health check TCP net location as something like :8001.

in the case that there's another container running on that node with host network, binding 8001 port as well, health check for the dead one will pass, and core will mark the dead container with status running=false and healthy=true.

duplicated metrics collector

eru-agent[213677]: time="2021-04-07T09:50:33+08:00" level=info msg="[Watch] Monitor: cid f5b10cf action attach"
eru-agent[213677]: time="2021-04-07T09:50:33+08:00" level=info msg="[Watch] Monitor: cid f5b10cf action attach"
eru-agent[213677]: panic: duplicate metrics collector registration attempted
eru-agent[213677]: goroutine 91514882 [running]:
eru-agent[213677]: github.com/prometheus/client_golang/prometheus.(*Registry).MustRegister(0xc0001345f0, 0xc014c3c500, 0x13, 0x13)
eru-agent[213677]:         /home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:401 +0xad
eru-agent[213677]: github.com/prometheus/client_golang/prometheus.MustRegister(...)
eru-agent[213677]:         /home/runner/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:178
eru-agent[213677]: github.com/projecteru2/agent/engine.NewMetricsClient(0x0, 0x0, 0x7ffe6dbd2f7b, 0x6, 0xc03ff3c370, 0x1)
eru-agent[213677]:         /home/runner/work/agent/agent/engine/metrics.go:165 +0x210d
eru-agent[213677]: github.com/projecteru2/agent/engine.(*Engine).stat(0xc0001c86c0, 0x1401e00, 0xc054a310c0, 0xc03ff3c370)
eru-agent[213677]:         /home/runner/work/agent/agent/engine/stat.go:41 +0x2e5
eru-agent[213677]: created by github.com/projecteru2/agent/engine.(*Engine).attach
eru-agent[213677]:         /home/runner/work/agent/agent/engine/attach.go:65 +0x632
systemd[1]: [email protected]: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: [email protected]: Unit entered failed state.
systemd[1]: [email protected]: Failed with result 'exit-code'.

enhancement: get `logline.Extra` maintained

Background

There is a flexible data container in each logging line when forwarding logging to journald: Extra.
But currently it's unusable:

  1. There is no writing on it.
  2. It's not in json format.

Enhancement

Define what should be filled into and how it's encoded.

Reference

{
    "Fields": {
        "DATE_TIME": "2021-04-29 18:02:39.296602",
        "ENTRY_POINT": "test",
        "EXTRA": "map[]",
        "ID": "11bedb8f3d35791c76fa68cf1fedf09558a8a95912326d3db55139b90e6bef27",
        "IDENT": "PewtMl",
        "MESSAGE": "message 64 bytes from 127.0.0.1: seq=49 ttl=64 time=0.071 ms",
        "PRIORITY": "3",
        "SYSLOG_IDENTIFIER": "test",
        "TYPE": "stdout",
        "_BOOT_ID": "b403cea00ac94ef8a2fc4f0c933a9431",
        "_CAP_EFFECTIVE": "3fffffffff",
        "_CMDLINE": "/usr/local/bin/eru-agent --config /etc/eru/agent.yaml",
        "_COMM": "eru-agent",
        "_EXE": "/usr/local/bin/eru-agent",
        "_GID": "0",
        "_HOSTNAME": "test01",
        "_MACHINE_ID": "80be556a18e047711dc7d96215a20528",
        "_PID": "148385",
        "_SOURCE_REALTIME_TIMESTAMP": "1619690559296662",
        "_SYSTEMD_CGROUP": "/system.slice/eru-agent.service",
        "_SYSTEMD_SLICE": "system.slice",
        "_SYSTEMD_UNIT": "eru-agent.service",
        "_TRANSPORT": "journal",
        "_UID": "0"
    },
    "Cursor": "s=5faedb7721d34f189ec9899c5b88cbdc;i=c9763;b=b403cea00ac94ef8a2fc4f0c933a9431;m=1a9e416ab3;t=5c119986ed59d;x=140dfabcc0f557ce",
    "RealtimeTimestamp": 1619690559296925,
    "MonotonicTimestamp": 114324236979
}

conn race condition

TLDR

docker attach 会打开一个 StreamConfig 对象 管理容器 IO 的转发,那么在 attach 客户端阻塞的情况下, StreamConfig 的 copyIO goroutine 会阻塞, 导致 Wait() 阻塞, 导致容器处理生命周期时阻塞, 所以容器内的进程已经被杀但是容器的状态依然是 running, 而 docker stop 就阻塞在发出 SIGKILL 后, 等待容器状态变为 NotRunning

long story

  • 尝试复现. 一开始我是把 filebeat 容器 stop 了, 把 agent 容器也 stop, 然后自己把 agent 源码搞上去用 dlv 单步调试, 可以复现, 不过有更简单的办法: 把 filebeat 和 agent stop 了, pip install docker, 在 python repl 里 attach 一个快速打日志的容器, 但是不去处理 attach, 过一会儿(<2min) docker logs --tail 1 -f 看日志就会发现被 attach 的容器已经不输出日志了, 进程阻塞了, 这时候再去 docker stop / kill 都会卡住, 成功复现。

  • 分析 docker 源码, 大体上已经在上面说了, 不过细节是这样的:

  1. copyIO 时候在内存里有 32k 的 buffer, 所以 attach 阻塞不会立刻导致容器进程阻塞;
  2. 杀死容器一号进程也不能中断 copyIO 的阻塞, 虽然 OS 会在进程退出时 close 所有的 fd, 但是因为 copyIO 的退出逻辑是从 Reader 里读到 EOF 并写入 Writer 里, 而写入 Writer 是阻塞的, 所以 copyIO 并不会因为 Reader close 而退出;
  3. 由于并没有 pprof 去看 docker 的阻塞点, 所以分析可能会有偏差, 比如我不是很确定是否在 StreamConfig.Wait() 之前就阻塞了, 因为也可能在之前的 DeleteTask 也会阻塞在 Wait IO 上, 不过那里的对象被封装得更加深层, 叫做 rio 对象, 而且是扔给 containerd 使用的; 不过应该不影响结论, 是因为 attach 阻塞导致 copyIO 阻塞导致容器生命周期不能正常切换;
  4. 由于 agent attach 的时候 config 设置 Log: false, 导致 attach 不是通过 docker 的 logdriver 去拿, 通过在 python repl 里测试可以确定如果开启了 Log flag 那么容器进程不会因为 attach 阻塞导致日志输出阻塞, 不过依然会导致 stop / kill 阻塞....
  • 回到 eru agent, 在关闭 filebeat 后发现了一个 race condition, 在 这里, w.conn 在判断 nil 的时候没有 mutex, 所以在稍后的 Close() 时会导致 nil.Close() 的 runtime panic; 这一点可以通过停掉 filebeat 后 grep agent 的日志确认, 间歇性的 panic, 同时 inspect agent container 也可以看到大量的重启; 也就是因为这个 panic 重启导致 agent 处理 attach 的间歇性仆街, 本来没有配置 filebeat 导致 forward connection refuse 也不会导致性能下降这么惨...

结论

从各种意义上来讲这玩意儿太他妈邪门
还有 attach 时开启 log flag 吧

agent can not get updated container config

when I upgrade one container from 4GB to 6GB,
eru agent still get the 4GB as the mem_max_usage metrics, cause the mem_percent shows wrong data:
mem_max_usage{appname=""} 4.66870272e+09
mem_percent{appname=""} 1.0853919982910156

Health check with exponential back off on startup

Use case

Assume the health check interval to T.
Assume the initial health check failed once receiving the started event from docker.

In the current implementation, the second chance to report its health will be after T time.
While in some large IDCs we have a significant long interval (T) and it could be more than 100 seconds.

Suggestion

When health check booted it may make more sense to proceed with a backoff interval up to T with initial unhealth.

Possible race condition when health checking on same container

Though it's marked as possible but it really happened.

Issue Description

A race on map occurred and it caused panics and termination. Logs are attached below.

Issue Analysis

health_check.go:57 checkOneContainer() could be invoked asynchronously:

  • checkAllContainers()
  • handleContainerStart()

And the labels map is shared between different invocations.
Generally speaking there are two ways to solve this:

  1. Always return a different copy of labels.
  2. Avoid race conditions

In this case the second solution is preferred personally. Because to check the same container concurrently doesn't make sense. So maybe introduce some kinds of locking in checkOneContainer() to skip duplicated checks is a simple way(Just for suggestions).

Related Logs

Feb  9 11:03:58 dev01 eru-agent[13787]: time="2021-02-09T11:03:58+08:00" level=info msg="[Watch] Monitor: cid e1851e2 action start"
Feb  9 11:03:58 dev01 eru-agent[13787]: time="2021-02-09T11:03:58+08:00" level=info msg="[attach] attach redisproxy container e1851e2 success"
Feb  9 11:03:58 dev01 eru-agent[13787]: fatal error: concurrent map iteration and map write
Feb  9 11:03:58 dev01 eru-agent[13787]: goroutine 26264 [running]:
Feb  9 11:03:58 dev01 eru-agent[13787]: runtime.throw(0x12942cf, 0x26)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/panic.go:1116 +0x72 fp=0xc000141aa0 sp=0xc000141a70 pc=0x437a92
Feb  9 11:03:58 dev01 eru-agent[13787]: runtime.mapiternext(0xc000474060)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/map.go:853 +0x554 fp=0xc000141b20 sp=0xc000141aa0 pc=0x410d34
Feb  9 11:03:58 dev01 eru-agent[13787]: reflect.mapiternext(0xc000474060)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/map.go:1337 +0x2b fp=0xc000141b38 sp=0xc000141b20 pc=0x469c8b
Feb  9 11:03:58 dev01 eru-agent[13787]: reflect.Value.MapKeys(0x10ca180, 0xc0005642a0, 0x15, 0x0, 0xc000141c18, 0x40429a)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/reflect/value.go:1227 +0x10c fp=0xc000141bc8 sp=0xc000141b38 pc=0x4a02ac
Feb  9 11:03:58 dev01 eru-agent[13787]: encoding/json.mapEncoder.encode(0x12c2d58, 0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x15, 0x10c0100)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:785 +0xff fp=0xc000141d40 sp=0xc000141bc8 pc=0x55351f
Feb  9 11:03:58 dev01 eru-agent[13787]: encoding/json.mapEncoder.encode-fm(0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x15, 0x1b00100)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:777 +0x65 fp=0xc000141d80 sp=0xc000141d40 pc=0x55fbe5
Feb  9 11:03:58 dev01 eru-agent[13787]: encoding/json.(*encodeState).reflectValue(0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x15, 0xc000140100)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:358 +0x82 fp=0xc000141db8 sp=0xc000141d80 pc=0x5508c2
Feb  9 11:03:58 dev01 eru-agent[13787]: encoding/json.(*encodeState).marshal(0xc00028ed80, 0x10ca180, 0xc0005642a0, 0x880100, 0x0, 0x0)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:330 +0xf4 fp=0xc000141e18 sp=0xc000141db8 pc=0x5504b4
Feb  9 11:03:58 dev01 eru-agent[13787]: encoding/json.Marshal(0x10ca180, 0xc0005642a0, 0x40, 0x106d340, 0xc00068a0b0, 0x45d964b800, 0x4b)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/encoding/json/encode.go:161 +0x52 fp=0xc000141e90 sp=0xc000141e18 pc=0x54faf2
Feb  9 11:03:58 dev01 eru-agent[13787]: github.com/projecteru2/agent/store/core.(*CoreStore).SetContainerStatus(0xc000509a40, 0x13f25c0, 0xc00003c658, 0xc000256370, 0xc000416000, 0x0, 0x0)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/store/core/container.go:29 +0x6d fp=0xc000141f78 sp=0xc000141e90 pc=0xb917ed
Feb  9 11:03:58 dev01 eru-agent[13787]: github.com/projecteru2/agent/engine.(*Engine).checkOneContainer(0xc0003fb1a0, 0xc000256370)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/engine/health_check.go:68 +0x77 fp=0xc000141fd0 sp=0xc000141f78 pc=0xbbd517
Feb  9 11:03:58 dev01 eru-agent[13787]: runtime.goexit()
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000141fd8 sp=0xc000141fd0 pc=0x470121
Feb  9 11:03:58 dev01 eru-agent[13787]: created by github.com/projecteru2/agent/engine.(*Engine).handleContainerStart
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/engine/monitor.go:52 +0x2cf
Feb  9 11:03:58 dev01 eru-agent[13787]: goroutine 1 [select, 1115 minutes]:
Feb  9 11:03:58 dev01 eru-agent[13787]: runtime.gopark(0x12c65c8, 0x0, 0x1809, 0x1)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/proc.go:306 +0xe5 fp=0xc00018fa68 sp=0xc00018fa48 pc=0x43a685
Feb  9 11:03:58 dev01 eru-agent[13787]: runtime.selectgo(0xc00018fc28, 0xc00018fbd0, 0x2, 0x1, 0x1)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/usr/local/golang/src/runtime/select.go:338 +0xcef fp=0xc00018fb90 sp=0xc00018fa68 pc=0x44a7ef
Feb  9 11:03:58 dev01 eru-agent[13787]: github.com/projecteru2/agent/engine.(*Engine).Run(0xc0003fb1a0, 0x13f2580, 0xc00020cec0, 0x0, 0x0)
Feb  9 11:03:58 dev01 eru-agent[13787]: #011/Users/jason.zhuyj/projects/eru2/agent/engine/engine.go:107 +0x239 fp=0xc00018fc88 sp=0xc00018fb90 pc=0xbbcaf9
Feb  9 11:03:58 dev01 eru-agent[13787]: main.serve(0xc00020c3c0, 0x0, 0x0)

Agent breaks down due to concurrency rw on Container.Labels (map[string]string)

the smallest reproduce code:

package main

import (
	"encoding/json"
	"os"
	"time"
)

type C struct {
	m map[string]string
}

func main() {

	c := C{
		m: map[string]string{"a": "b"},
	}

	go func() {
		for {
			json.Marshal(c.m)
		}
	}()

	go func() {
		time.Sleep(1 * time.Second)
		delete(c.m, "a")
	}()

	sigs := make(chan os.Signal, 1)
	<-sigs
}

Since it's a race condition that we can't ensure it'll take place every time, we can still capture the panic in several times of try.

/go/src/tmp/panic # ./agent_panic
panic: reflect: call of reflect.Value.Type on zero Value [recovered]
	panic: reflect: call of reflect.Value.Type on zero Value

goroutine 6 [running]:
encoding/json.(*encodeState).marshal.func1(0xc000056f38)
	/usr/local/go/src/encoding/json/encode.go:328 +0x8d
panic(0x4c02e0, 0xc000326858)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
reflect.Value.Type(0x0, 0x0, 0x0, 0xc000321170, 0x0)
	/usr/local/go/src/reflect/value.go:1908 +0x189
encoding/json.stringEncoder(0xc00009e080, 0x0, 0x0, 0x0, 0xc000320100)
	/usr/local/go/src/encoding/json/encode.go:620 +0x4c
encoding/json.mapEncoder.encode(0x4e08e8, 0xc00009e080, 0x4c0d00, 0xc00005e150, 0x15, 0x4c0100)
	/usr/local/go/src/encoding/json/encode.go:813 +0x366
encoding/json.(*encodeState).reflectValue(0xc00009e080, 0x4c0d00, 0xc00005e150, 0x15, 0xc000050100)
	/usr/local/go/src/encoding/json/encode.go:360 +0x82
encoding/json.(*encodeState).marshal(0xc00009e080, 0x4c0d00, 0xc00005e150, 0x100, 0x0, 0x0)
	/usr/local/go/src/encoding/json/encode.go:332 +0xf9
encoding/json.Marshal(0x4c0d00, 0xc00005e150, 0xc0003157a0, 0x9, 0x10, 0x0, 0x0)
	/usr/local/go/src/encoding/json/encode.go:161 +0x52
main.main.func1(0xc00005e150)
	/go/src/tmp/panic/main.go:21 +0x37
created by main.main
	/go/src/tmp/panic/main.go:19 +0x9f

And this is exactly the same panic log @jason-joo reported to me.

部署agent的时候

time="2020-08-20T07:21:32Z" level=info msg="[EruResolver] start sync service discovery"
time="2020-08-20T07:21:32Z" level=info msg="[logServe] Log monitor started"
time="2020-08-20T07:21:32Z" level=fatal msg="rpc error: code = Unknown desc = bad Count value: key: /node/370e66f31bc8"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.