So if Sygnal cannot send iOS pushes, but Android is still operational, then Android pu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ok I'm going to mark this as done in favour of <a class="issue-link js-issue-link" dat

A persistent failure of one route should not impact the ability to send through another about sygnal HOT 7 CLOSED

neilisfragile commented on September 1, 2024

A persistent failure of one route should not impact the ability to send through another

from sygnal.

Comments (7)

Half-Shot commented on September 1, 2024

I'm surprised that a failure to send one notification would lead to another being impacted, at least on a small scale. Is this related to us having too many open connections to the IOS API and therefore impacting Android's ability to work?

from sygnal.

richvdh commented on September 1, 2024

@Half-Shot: well, empirically all push died despite only one certificate expiring.

We've raised a task to investigate in more detail about what happened.

from sygnal.

clokep commented on September 1, 2024

Looking at this briefly, it seems that if a notification dispatch for a particular device fails then subsequent notifications will be skipped

The exceptions for dispatch_notification are caught outside the loop instead of per device, see:

sygnal/sygnal/http.py

Lines 215 to 266 in 8567224

 try: 

 rejected = [] 

 for d in notif.devices: 

 NOTIFS_RECEIVED_DEVICE_PUSH_COUNTER.inc() 

 appid = d.app_id 

 if appid not in self.sygnal.pushkins: 

 log.warning("Got notification for unknown app ID %s", appid) 

 rejected.append(d.pushkey) 

 continue 

 pushkin = self.sygnal.pushkins[appid] 

 log.debug( 

 "Sending push to pushkin %s for app ID %s", pushkin.name, appid 

 ) 

 NOTIFS_BY_PUSHKIN.labels(pushkin.name).inc() 

 result = await pushkin.dispatch_notification(notif, d, context) 

 if not isinstance(result, list): 

 raise TypeError("Pushkin should return list.") 

 rejected += result 

 request.write(json.dumps({"rejected": rejected}).encode()) 

 if rejected: 

 log.info( 

 "Successfully delivered notifications with %d rejected pushkeys", 

 len(rejected), 

 ) 

 except NotificationDispatchException: 

 request.setResponseCode(502) 

 log.warning("Failed to dispatch notification.", exc_info=True) 

 except Exception: 

 request.setResponseCode(500) 

 log.error("Exception whilst dispatching notification.", exc_info=True) 

 finally: 

 if not request._disconnected: 

 request.finish() 

 PUSHGATEWAY_HTTP_RESPONSES_COUNTER.labels(code=request.code).inc() 

 root_span.set_tag(tags.HTTP_STATUS_CODE, request.code) 

 req_time = time.perf_counter() - context.start_time 

 if req_time > 0: 

 # can be negative as perf_counter() may not be monotonic 

 NOTIFY_HANDLE_HISTOGRAM.labels(code=request.code).observe(req_time) 

 if not 200 <= request.code < 300: 

 root_span.set_tag(tags.ERROR, True) 

 root_span.finish()

It seems this is done so that the response can be properly set to an error if the notification fails, this seems buggy though since a notification could partially go through and still get an error response. (E.g. if the initial device is successful and the second device is a failure...) Unfortunately this doesn't really seem to explain what is seen here though since it seems Synapse only sends a single device at a time:

https://github.com/matrix-org/synapse/blob/0d775fcc2d0c7b6a07dad5430256d4d6c75a9f0d/synapse/push/httppusher.py#L309-L370

From this it isn't clear to me how there could be an interaction between different pushkins.

from sygnal.

richvdh commented on September 1, 2024

It's very frustrating that we haven't really looked at this until 11 days later, by which time all the logs have rotated. I think we need to reconsider how this sort of investigation gets prioritised.

Having said that, we have some information from our various prometheus metrics.

Note that we have two separate sygnal instances. These are routed to by haproxy, which has slightly different configurations for the two servers: sygnal-twisted-1 has a connection limit of 200, while sygnal-twisted-0 is unlimited.

First sygnal-twisted-1. At 14:10 UTC, we see connections stacking up on haproxy, where they time out after 5 seconds:

This is consistent with sygnal suddenly taking a long time to process all or some requests, and is enough to explain why pushes unrelated to the failed certificate started failing.

Some related graphs from the sygnal end showing the same thing:

... so basically all the APNS requests start stacking up infinitely. Meanwhile the queue time for GCM pushes drops to zero, presumably because haproxy starts dropping a percentage of them:

ok, so now sygnal-twisted-0. This seemed to be fine until the sygnal process was restarted at 18:11 UTC, when suddenly lots of requests start timing out. Presumably the cert is only checked when we reconnect to the APNS server, or something.

The 5xxs are initially due to haproxy timing out the request (after 3 minutes), or giving up trying to connect to the server (after 10 retries) when we start getting connection errors later:

The request timeouts appear to be the same as before: that Sygnal starts tarpitting the requests. I think the connection errors are just due to Sygnal taking a long time to accept, possibly because it was busy garbage-collecting:

It looks like the percentage of non-APNS requests actually failing on this server was much smaller, since they weren't being blocked by the haproxy.

So in summary the behaviour can be explained by

when the APNS requests failed, sygnal started taking a very long time to process those requests
that meant that the slots on the haproxy got full of APNS requests
so other requests started failing too.

So the main question is why sygnal wasn't responding to these failing requests, and instead was allowing them to stack up forever, leading to 5GB of memory use.

Some other operational questions:

should we increase the maxconn on sygnal-twisted-1? I seem to think we added it because we had problems with large numbers of requests stacking up when things got busy
shouldn't sygnal-twisted-0 have a maxconn?

from sygnal.

richvdh commented on September 1, 2024

So the main question is why sygnal wasn't responding to these failing requests, and instead was allowing them to stack up forever, leading to 5GB of memory use.

Part of the problem here might be the retry schedule:

4 attempts at 1 second intervals in aioapns (we set pool.max_connection_attempts=3, but I think there is an off-by-one error. The connection pool stuff doesn't really do anything until a connection is established.)
3 attempts at 10 and 20 second intervals in apnspushkin.

So that's a total of 3s+10s+3s+20s+3s == 39s for each push, so that doesn't really explain why it kept going up, rather than plateauing.

I'm at a bit of a loss here. I think we'd have to set up a test sygnal and try to reproduce the problem somehow, which seems a bit out of scope for now.

from sygnal.

richvdh commented on September 1, 2024

So that's a total of 3s+10s+3s+20s+3s == 39s for each push, so that doesn't really explain why it kept going up, rather than plateauing.

... although it is of course enough time to explain why the server with a maxconn soon got saturated with apns requests.

from sygnal.

richvdh commented on September 1, 2024

ok I'm going to mark this as done in favour of #116 and #117

from sygnal.

A persistent failure of one route should not impact the ability to send through another about sygnal HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	try:
	rejected = []

	for d in notif.devices:
	NOTIFS_RECEIVED_DEVICE_PUSH_COUNTER.inc()

	appid = d.app_id
	if appid not in self.sygnal.pushkins:
	log.warning("Got notification for unknown app ID %s", appid)
	rejected.append(d.pushkey)
	continue

	pushkin = self.sygnal.pushkins[appid]
	log.debug(
	"Sending push to pushkin %s for app ID %s", pushkin.name, appid
	)

	NOTIFS_BY_PUSHKIN.labels(pushkin.name).inc()

	result = await pushkin.dispatch_notification(notif, d, context)
	if not isinstance(result, list):
	raise TypeError("Pushkin should return list.")

	rejected += result

	request.write(json.dumps({"rejected": rejected}).encode())

	if rejected:
	log.info(
	"Successfully delivered notifications with %d rejected pushkeys",
	len(rejected),
	)
	except NotificationDispatchException:
	request.setResponseCode(502)
	log.warning("Failed to dispatch notification.", exc_info=True)
	except Exception:
	request.setResponseCode(500)
	log.error("Exception whilst dispatching notification.", exc_info=True)
	finally:
	if not request._disconnected:
	request.finish()

	PUSHGATEWAY_HTTP_RESPONSES_COUNTER.labels(code=request.code).inc()
	root_span.set_tag(tags.HTTP_STATUS_CODE, request.code)

	req_time = time.perf_counter() - context.start_time
	if req_time > 0:
	# can be negative as perf_counter() may not be monotonic
	NOTIFY_HANDLE_HISTOGRAM.labels(code=request.code).observe(req_time)
	if not 200 <= request.code < 300:
	root_span.set_tag(tags.ERROR, True)
	root_span.finish()