Comments (7)
I'm surprised that a failure to send one notification would lead to another being impacted, at least on a small scale. Is this related to us having too many open connections to the IOS API and therefore impacting Android's ability to work?
from sygnal.
@Half-Shot: well, empirically all push died despite only one certificate expiring.
We've raised a task to investigate in more detail about what happened.
from sygnal.
Looking at this briefly, it seems that if a notification dispatch for a particular device fails then subsequent notifications will be skipped
The exceptions for dispatch_notification
are caught outside the loop instead of per device, see:
Lines 215 to 266 in 8567224
It seems this is done so that the response can be properly set to an error if the notification fails, this seems buggy though since a notification could partially go through and still get an error response. (E.g. if the initial device is successful and the second device is a failure...) Unfortunately this doesn't really seem to explain what is seen here though since it seems Synapse only sends a single device at a time:
From this it isn't clear to me how there could be an interaction between different pushkins.
from sygnal.
It's very frustrating that we haven't really looked at this until 11 days later, by which time all the logs have rotated. I think we need to reconsider how this sort of investigation gets prioritised.
Having said that, we have some information from our various prometheus metrics.
Note that we have two separate sygnal instances. These are routed to by haproxy, which has slightly different configurations for the two servers: sygnal-twisted-1
has a connection limit of 200, while sygnal-twisted-0
is unlimited.
First sygnal-twisted-1
. At 14:10 UTC, we see connections stacking up on haproxy, where they time out after 5 seconds:
This is consistent with sygnal suddenly taking a long time to process all or some requests, and is enough to explain why pushes unrelated to the failed certificate started failing.
Some related graphs from the sygnal end showing the same thing:
... so basically all the APNS requests start stacking up infinitely. Meanwhile the queue time for GCM pushes drops to zero, presumably because haproxy starts dropping a percentage of them:
ok, so now sygnal-twisted-0
. This seemed to be fine until the sygnal process was restarted at 18:11 UTC, when suddenly lots of requests start timing out. Presumably the cert is only checked when we reconnect to the APNS server, or something.
The 5xxs are initially due to haproxy timing out the request (after 3 minutes), or giving up trying to connect to the server (after 10 retries) when we start getting connection errors later:
The request timeouts appear to be the same as before: that Sygnal starts tarpitting the requests. I think the connection errors are just due to Sygnal taking a long time to accept, possibly because it was busy garbage-collecting:
It looks like the percentage of non-APNS requests actually failing on this server was much smaller, since they weren't being blocked by the haproxy.
So in summary the behaviour can be explained by
- when the APNS requests failed, sygnal started taking a very long time to process those requests
- that meant that the slots on the haproxy got full of APNS requests
- so other requests started failing too.
So the main question is why sygnal wasn't responding to these failing requests, and instead was allowing them to stack up forever, leading to 5GB of memory use.
Some other operational questions:
- should we increase the maxconn on sygnal-twisted-1? I seem to think we added it because we had problems with large numbers of requests stacking up when things got busy
- shouldn't sygnal-twisted-0 have a maxconn?
from sygnal.
So the main question is why sygnal wasn't responding to these failing requests, and instead was allowing them to stack up forever, leading to 5GB of memory use.
Part of the problem here might be the retry schedule:
- 4 attempts at 1 second intervals in aioapns (we set
pool.max_connection_attempts=3
, but I think there is an off-by-one error. The connection pool stuff doesn't really do anything until a connection is established.) - 3 attempts at 10 and 20 second intervals in apnspushkin.
So that's a total of 3s+10s+3s+20s+3s == 39s
for each push, so that doesn't really explain why it kept going up, rather than plateauing.
I'm at a bit of a loss here. I think we'd have to set up a test sygnal and try to reproduce the problem somehow, which seems a bit out of scope for now.
from sygnal.
So that's a total of 3s+10s+3s+20s+3s == 39s for each push, so that doesn't really explain why it kept going up, rather than plateauing.
... although it is of course enough time to explain why the server with a maxconn soon got saturated with apns requests.
from sygnal.
ok I'm going to mark this as done in favour of #116 and #117
from sygnal.
Related Issues (20)
- FCM push for iOS issue HOT 2
- Web push notifications are not working with web.push.apple.com (Safari) HOT 3
- missing setup.py file HOT 3
- UnifiedPush support HOT 11
- Upgrade dependencies HOT 1
- Documentation for using Sygnal behind a reverse proxy HOT 9
- New tagged version HOT 2
- Disable logging for health endpoint
- anyway to use gcm/fcm key as of today 24/10/2023 for android notification HOT 2
- `sygnal.__version__` is broken
- Tests failing with Twisted 23.10
- twisted Timing out client: IPv6Address HOT 1
- ios push how to cancel user_id HOT 1
- GCMPushkin assumes that content field is not null. Fails when push format is `event_id_only`
- Unable to send Push notification with content_available and mutable_content configurations in sygnal.yaml HOT 13
- Sygnal only sending data-only FCM messages HOT 4
- v0.14.0 FCM error sending notification HOT 9
- FCM v1 Upgrade - 'NoneType' object has no attribute 'items' HOT 2
- docker build failed from tar.gz source HOT 1
- FCM API Credential refresh mechanism doesn't support HTTP proxies and blocks the event loop HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sygnal.