Giter Club home page Giter Club logo

Comments (7)

Half-Shot avatar Half-Shot commented on September 1, 2024

I'm surprised that a failure to send one notification would lead to another being impacted, at least on a small scale. Is this related to us having too many open connections to the IOS API and therefore impacting Android's ability to work?

from sygnal.

richvdh avatar richvdh commented on September 1, 2024

@Half-Shot: well, empirically all push died despite only one certificate expiring.

We've raised a task to investigate in more detail about what happened.

from sygnal.

clokep avatar clokep commented on September 1, 2024

Looking at this briefly, it seems that if a notification dispatch for a particular device fails then subsequent notifications will be skipped

The exceptions for dispatch_notification are caught outside the loop instead of per device, see:

sygnal/sygnal/http.py

Lines 215 to 266 in 8567224

try:
rejected = []
for d in notif.devices:
NOTIFS_RECEIVED_DEVICE_PUSH_COUNTER.inc()
appid = d.app_id
if appid not in self.sygnal.pushkins:
log.warning("Got notification for unknown app ID %s", appid)
rejected.append(d.pushkey)
continue
pushkin = self.sygnal.pushkins[appid]
log.debug(
"Sending push to pushkin %s for app ID %s", pushkin.name, appid
)
NOTIFS_BY_PUSHKIN.labels(pushkin.name).inc()
result = await pushkin.dispatch_notification(notif, d, context)
if not isinstance(result, list):
raise TypeError("Pushkin should return list.")
rejected += result
request.write(json.dumps({"rejected": rejected}).encode())
if rejected:
log.info(
"Successfully delivered notifications with %d rejected pushkeys",
len(rejected),
)
except NotificationDispatchException:
request.setResponseCode(502)
log.warning("Failed to dispatch notification.", exc_info=True)
except Exception:
request.setResponseCode(500)
log.error("Exception whilst dispatching notification.", exc_info=True)
finally:
if not request._disconnected:
request.finish()
PUSHGATEWAY_HTTP_RESPONSES_COUNTER.labels(code=request.code).inc()
root_span.set_tag(tags.HTTP_STATUS_CODE, request.code)
req_time = time.perf_counter() - context.start_time
if req_time > 0:
# can be negative as perf_counter() may not be monotonic
NOTIFY_HANDLE_HISTOGRAM.labels(code=request.code).observe(req_time)
if not 200 <= request.code < 300:
root_span.set_tag(tags.ERROR, True)
root_span.finish()

It seems this is done so that the response can be properly set to an error if the notification fails, this seems buggy though since a notification could partially go through and still get an error response. (E.g. if the initial device is successful and the second device is a failure...) Unfortunately this doesn't really seem to explain what is seen here though since it seems Synapse only sends a single device at a time:

https://github.com/matrix-org/synapse/blob/0d775fcc2d0c7b6a07dad5430256d4d6c75a9f0d/synapse/push/httppusher.py#L309-L370

From this it isn't clear to me how there could be an interaction between different pushkins.

from sygnal.

richvdh avatar richvdh commented on September 1, 2024

It's very frustrating that we haven't really looked at this until 11 days later, by which time all the logs have rotated. I think we need to reconsider how this sort of investigation gets prioritised.

Having said that, we have some information from our various prometheus metrics.

Note that we have two separate sygnal instances. These are routed to by haproxy, which has slightly different configurations for the two servers: sygnal-twisted-1 has a connection limit of 200, while sygnal-twisted-0 is unlimited.

First sygnal-twisted-1. At 14:10 UTC, we see connections stacking up on haproxy, where they time out after 5 seconds:

image

This is consistent with sygnal suddenly taking a long time to process all or some requests, and is enough to explain why pushes unrelated to the failed certificate started failing.

Some related graphs from the sygnal end showing the same thing:

image
image
image

... so basically all the APNS requests start stacking up infinitely. Meanwhile the queue time for GCM pushes drops to zero, presumably because haproxy starts dropping a percentage of them:

image


ok, so now sygnal-twisted-0. This seemed to be fine until the sygnal process was restarted at 18:11 UTC, when suddenly lots of requests start timing out. Presumably the cert is only checked when we reconnect to the APNS server, or something.

image

The 5xxs are initially due to haproxy timing out the request (after 3 minutes), or giving up trying to connect to the server (after 10 retries) when we start getting connection errors later:

image

The request timeouts appear to be the same as before: that Sygnal starts tarpitting the requests. I think the connection errors are just due to Sygnal taking a long time to accept, possibly because it was busy garbage-collecting:

image

It looks like the percentage of non-APNS requests actually failing on this server was much smaller, since they weren't being blocked by the haproxy.


So in summary the behaviour can be explained by

  • when the APNS requests failed, sygnal started taking a very long time to process those requests
  • that meant that the slots on the haproxy got full of APNS requests
  • so other requests started failing too.

So the main question is why sygnal wasn't responding to these failing requests, and instead was allowing them to stack up forever, leading to 5GB of memory use.

Some other operational questions:

  • should we increase the maxconn on sygnal-twisted-1? I seem to think we added it because we had problems with large numbers of requests stacking up when things got busy
  • shouldn't sygnal-twisted-0 have a maxconn?

from sygnal.

richvdh avatar richvdh commented on September 1, 2024

So the main question is why sygnal wasn't responding to these failing requests, and instead was allowing them to stack up forever, leading to 5GB of memory use.

Part of the problem here might be the retry schedule:

  • 4 attempts at 1 second intervals in aioapns (we set pool.max_connection_attempts=3, but I think there is an off-by-one error. The connection pool stuff doesn't really do anything until a connection is established.)
  • 3 attempts at 10 and 20 second intervals in apnspushkin.

So that's a total of 3s+10s+3s+20s+3s == 39s for each push, so that doesn't really explain why it kept going up, rather than plateauing.

I'm at a bit of a loss here. I think we'd have to set up a test sygnal and try to reproduce the problem somehow, which seems a bit out of scope for now.

from sygnal.

richvdh avatar richvdh commented on September 1, 2024

So that's a total of 3s+10s+3s+20s+3s == 39s for each push, so that doesn't really explain why it kept going up, rather than plateauing.

... although it is of course enough time to explain why the server with a maxconn soon got saturated with apns requests.

from sygnal.

richvdh avatar richvdh commented on September 1, 2024

ok I'm going to mark this as done in favour of #116 and #117

from sygnal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.