Giter Club home page Giter Club logo

Comments (10)

kaechele avatar kaechele commented on May 28, 2024 2

Unfortunately, Dr. Jung (:tada:), I have been unable to capture a log in this situation. But I would not be surprised given the current state of the client that there is some case in which the state machine goes berserk.

from tunneldigger.

RalfJung avatar RalfJung commented on May 28, 2024 1

@valcryst have you or do you plan to upstream this change into gluon itself?

In saarland we have patched our own secondary tunneldigger watchdog to count the number of tunneldigger processes, and issue a reboot if there are too many of them. That also seems to help.

from tunneldigger.

RalfJung avatar RalfJung commented on May 28, 2024

Thanks. :) (However you found out about that.^^)

Yeah, I have fairly low confidence in that state machine code... but without a log I also don't really know where to start debugging this.

from tunneldigger.

valcryst avatar valcryst commented on May 28, 2024

We are facing this issue for some months now, but we wherent able to get access to one of the affected routers till now.
We observed that this symptom always starts after the "daily" reboot and affects around 70-80 Routers out of 1000.

image

When this happens you see Tunneldigger Broker instantly running the "down-hook" after creating the tunnel.
This is what we have seen on the Tunneldigger Client on the Router.

Sat Jun 11 13:09:25 2022 daemon.debug td-client: [broker2.ff-en.de:10122] Broker usage: 4587
Sat Jun 11 13:09:25 2022 daemon.info td-client: Selected broker1.ff-en.de:10122 as the best broker.
Sat Jun 11 13:09:25 2022 daemon.info td-client: [broker1.ff-en.de:10122] Tunnel successfully established.
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is enabled
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Network device 'mesh-vpn' link is up
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' has link connectivity
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is setting up now
Sat Jun 11 13:09:25 2022 daemon.notice netifd: Interface 'mesh_vpn' is now up
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.162513] batman_adv: bat0: Adding interface: mesh-vpn
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.168110] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1420) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.192795] batman_adv: bat0: Interface activated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Unable to create local L2TP tunnel!
Sat Jun 11 13:09:26 2022 daemon.err td-client: [broker1.ff-en.de:10122] Connection lost.
Sat Jun 11 13:09:26 2022 kern.info kernel: [293673.348378] batman_adv: bat0: Interface deactivated: mesh-vpn
Sat Jun 11 13:09:26 2022 daemon.notice netifd: Network device 'mesh-vpn' link is down

But more interesting is this "ps w" output.

root@FF-WTT-Citymanagement-Wetter:~# ps wwwww
  PID USER       VSZ STAT COMMAND
    1 root      1568 S    /sbin/procd
    2 root         0 SW   [kthreadd]
    4 root         0 IW<  [kworker/0:0H]
    6 root         0 IW<  [mm_percpu_wq]
    7 root         0 SW   [ksoftirqd/0]
   43 root         0 SW   [oom_reaper]
   90 root         0 IW<  [writeback]
   92 root         0 SW   [kcompactd0]
   93 root         0 IW<  [crypto]
   95 root         0 IW<  [kblockd]
  116 root         0 IW<  [watchdogd]
  129 root         0 IW   [kworker/0:1]
  138 root         0 SW   [kswapd0]
  197 root         0 SW   [spi0]
  292 root         0 IW<  [ipv6_addrconf]
  300 root         0 IW<  [kworker/0:1H]
  379 root         0 SWN  [jffs2_gcd_mtd4]
  446 root      1216 S    /sbin/ubusd
  447 root       924 S    /sbin/askfirst /usr/libexec/login.sh
  464 root      1028 S    /sbin/urngd
  635 root         0 IW<  [cfg80211]
  659 root         0 IW<  [bat_events]
  769 root      1296 S    /sbin/logd -S 64
  821 dnsmasq   1348 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
  836 root         0 IW   [kworker/0:2]
  855 root      1080 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 1022 root       932 S    /usr/sbin/gluon-arp-limiter
 1069 root      1764 S    /sbin/netifd
 1148 root       932 S    /usr/sbin/uradvd -i local-node -p 2a11:6c6:6000:feed::/64
 1388 root      1212 S    udhcpc -p /var/run/udhcpc-br-wan.pid -s /lib/netifd/dhcp.script -f -t 0 -i br-wan -x hostname:FF-WTT-Citymanagement-Wetter -C -O 121
 1391 root      1052 S    odhcp6c -s /lib/netifd/dhcpv6.script -t120 br-wan
 1396 root      1052 S    odhcp6c -s /lib/netifd/dhcpv6.script -L -m30 -t120 br-client
 1531 root      1124 S    /usr/sbin/gluon-radv-filterd -i br-client -c RADV_FILTER -t 20
 1587 root       932 S    /usr/sbin/micrond /usr/lib/micron.d
 1664 root       928 S    /usr/sbin/sse-multiplexd
 1760 root      1364 S    /usr/sbin/uhttpd -f -h /lib/gluon/status-page/www -r FF-WTT-Citymanagement-Wetter -x /cgi-bin -t 60 -T 30 -k 20 -A 1 -n 32 -N 100 -R -p 0.0.0.0:80 -p [::]:80
 1930 root      1352 S    /usr/sbin/dnsmasq -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/gluon/wan-dnsmasq/resolv.conf
 1941 root      1068 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 1950 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 1951 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 2049 root      1584 S    /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
 2073 root      1216 S<   /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug -p 2a11:6c6:6000:feed:ff::1 -p 2a11:6c6:6000:feed:ff::2 -p 91.212.181.1
 4095 root         0 IW   [kworker/u2:0]
 4204 root      2324 S    /usr/bin/respondd -d /usr/lib/respondd -p 1001 -g ff02::2:1001 -i mesh0 -g ff05::2:1001 -i br-client -t 10
 4519 root      1148 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 4563 root      1220 S    -ash
 4979 root      1068 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 4980 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 4981 nobody    1040 S    /usr/bin/tunneldigger -u d46e0ed65cbc -i mesh-vpn -t 1 -b broker1.ff-en.de:10122 -b broker2.ff-en.de:10122 -I br-wan -a
 5038 root      1212 R    ps wwwww
28583 root         0 IW   [kworker/u2:3]
32214 root         0 IW   [kworker/u2:2]
root@FF-WTT-Citymanagement-Wetter:~#

It looks like it gets started twice and this will last till next reboot (eventualy).
I suspect that some changes how the Tunneldigger watchdog kills the process introduced this?

Kindly Regards
Matthias

from tunneldigger.

RalfJung avatar RalfJung commented on May 28, 2024

It looks like it gets started twice

Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)

Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it.
(In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)

This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.

from tunneldigger.

valcryst avatar valcryst commented on May 28, 2024

Oh, yeah that would not be good. So this is an interesting lead indeed. In fact there are 6 entries for tunneldigger on your list? On one of my devices I see 3, so I guess that is the normal number. (Is this one line per thread? And then the main process is forking and we still see the parent process? Or something like that?)

Indeed its the main process and one thread for each broker in the broker list.

Which watchdog changes are you talking about? We have our own separate watchdog at https://git.hacksaar.de/FreifunkSaar/gluon-ffsaar/-/tree/master/gluon-ffsaar-watchdog since the "built-in" one has proven to be insufficient, but that one does a full device reboot, so that can't be it. (In fact as a work-around I could make that watchdog initiate a reboot when it sees more than 3 instances of /usr/bin/tunneldigger...)

Well i stopped building firmware long ago in our community and someone else is doing it now.
I refer to the old watchdog that was inside tunneldigger before it was accepted into gluon repo.

Old Watchdog
New Watchdog

It included an "killall" wich i assume covered cases like this, the newer one touches only matching pids as it seems.

This would have to be a bug in `/etc/init.d/tunneldigger, I think? That just uses /sbin/start-stop-daemon though so it'd be a bug in there.

You mean this?
https://github.com/freifunk-gluon/packages/blob/master/net/tunneldigger/files/tunneldigger.init#L79

I dont know how it comes that routers end up with two instances of tunneldigger,
but it could be a starting point to look why it happens randomly after daily reboots.

from tunneldigger.

RalfJung avatar RalfJung commented on May 28, 2024

Indeed its the main process and one thread for each broker in the broker list.

That's not what I see. We have 4 brokers and I see 3 processes total.

# ps | grep tunneldigger
 1972 root      1068 S    /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
 1973 nobody    1040 S    /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a
 1974 nobody    1040 S    /usr/bin/tunneldigger -u 30b5c26e9d6b -i mesh-vpn -t 1 -b gw1.saar.freifunk.net:10001 -b gw2.saar.freifunk.net:10001 -b gw3.saar.freifunk.net:10001 -b gw4.saar.freifunk.net:10001 -I br-wan -a

from tunneldigger.

valcryst avatar valcryst commented on May 28, 2024

Ok then this assumption was wrong.
Today i have more than 100 routers affected by this problem and the only way to keep
the gateways running while they get hammered is to block them by rate-limiter.

This is how it looks on one of the gateways.

image


I think i will modify that tunneldigger watchdog in gluon and add a killall to see if that fixes this mess.

from tunneldigger.

valcryst avatar valcryst commented on May 28, 2024

Hey @RalfJung i just wanted to let you know, we have overcome this issue by patching
the tunneldigger-watchdog that ships with Gluon. We couldnt investigate deeper and needed
a quick solution, because lots of our refugee installations where hit by this problem.
It seems that "killall -KILL tunneldigger" was our saviour here.

Kindly Regards

#!/usr/bin/lua

local uci = require('simple-uci').cursor()

local function restart_tunneldigger()
	os.execute('logger -t tunneldigger-watchdog "Restarting Tunneldigger."')
	os.execute('/etc/init.d/tunneldigger stop')
	os.execute('sleep 1')
	os.execute('killall -KILL tunneldigger')
	os.execute('rm -f /var/run/tunneldigger.mesh-vpn.pid'
	os.execute('sleep 5')
	os.execute('/etc/init.d/tunneldigger start')
end

local function read_pid_file()
	local pid_file = io.open('/var/run/tunneldigger.mesh-vpn.pid', 'r')
	if not pid_file then
		return nil
	end
	local pid = pid_file:read('*l')
	pid_file:close()
	return pid
end

local function has_mesh_vpn_neighbours()
	local handle = io.popen('batctl o', 'r')
	if not handle then
		return false
	end
	for line in handle:lines() do
		if line:find('mesh%-vpn') then
			handle:close()
			return true
		end
	end
	handle:close()
	return false
end

if uci:get_bool('tunneldigger', 'mesh_vpn', 'enabled') then
	if io.popen('pgrep -x /usr/bin/tunneldigger'):read('*l') ~= read_pid_file() then
		os.execute('logger -t tunneldigger-watchdog "Process-Pid does not match with pid-File."')
		restart_tunneldigger()
		return
	end
	if not has_mesh_vpn_neighbours() then
		os.execute('logger -t tunneldigger-watchdog "No vpn-mesh neighbours found."')
		restart_tunneldigger()
		return
	end
end

from tunneldigger.

valcryst avatar valcryst commented on May 28, 2024

@RalfJung I've sent my feedback, but i'm no coder and maybe the experts on this find out
the real cause that triggers this unwanted behaviour of multiple running tunneldiggers.

from tunneldigger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.