trio-gitter-bot's Introduction

Gitter Bot for Python-Trio project

Watch for new questions in stackoverflow tagged with python-trio, and post it to Python Trio's Gitter channel.

How it works

Run as a scheduled task every 10 minutes.
Read the RSS feed at https://stackoverflow.com/feeds/tag?tagnames=python-trio&sort=newest .
If there is post newer than 10 minutes ago, post that to Gitter.

About the bot

This bot was heavily inspired by gidgethub and CPython's GitHub bots.

trio-gitter-bot's People

Contributors

Stargazers

Watchers

trio-gitter-bot's Issues

💡 Use trio in this project

It would be great to use Trio to build the bot for Trio.

logic for which posts to send seems inverted somehow

So someone finally posted a python-trio question on SO this evening:

https://stackoverflow.com/questions/57742649/asks-async-https-client-ssleoferror-on-connect-attempt-to-server-with-self-s

and the bot noticed and sprang into action! It posted 29 links to the chat, none of which are the new question:

https://gitter.im/python-trio/general?at=5d6b21c929dba2421ce0bc11

The gitter timestamps are within 5 min of the question being posted.

This is kind of hilarious, but also very confusing. 🤣😕⁉️

It's almost like the logic is:

IF there's a question posted that was posted in the last 10 min, THEN post all the questions that weren't posted in the last ten minutes

Maybe RSSReader.read(newer_than=...) does not work the way one would expect?

more reliable algorithms for choosing which questions to post

I don't know if this matters at all. I suspect probably not. But it was an interesting puzzle that I got nerd-sniped by, so I figured I'd right down what I thought of :-)

The bot's goal is to post questions without any duplicates or missing any, ie exactly-once delivery. As we know from distributed systems theory, reliable exactly-once delivery is ludicrously difficult or impossible, so we want to pick some "good enough" heuristic that isn't too complicated to implement. So the challenge is to optimize that reliability/simplicity tradeoff.

Right now the bot uses an extremely simple heuristic: it polls for questions every ten minutes, and then it posts any questions that have timestamps newer than ten minutes. I love this because if you think of it from a distributed systems perspective it's obviously too simple to work (schedulers aren't accurate! We're comparing clocks across two completely separate platforms!), but in fact it (probably) works very well and it's so simple that it's actually hard to beat.

It does at least theoretically have flaws though: successive ten minute windows are never going to line up exactly; there will always be a gap or some overlap. I guess probably a few seconds every ten minutes. If a question happens to be posted during that time, then it will be either double-posted or lost entirely. Can this be avoided, without massively complicating the implementation?

I did have one clever idea: when we start up, fetch the feed and store a list in-memory of which questions are already there. Then on future iterations, fetch the list, and post all the questions that aren't in our in-memory list, and add them to the list. That totally removes the dependency on accurate clocks, and is still super simple. It does have the downside that if a question is posted a few minutes before the bot is restarted, it might get lost entirely. I guess we could try to minimize that chance by using a hybrid of the two approaches: schedule our regular checks at a known absolute time (like: not just every ten minutes, but at 1:00, 1:10, 1:20, etc.), and then at startup (only) use the question timestamps and assume that anything since the last scheduled check time was probably not posted by our previous incarnation, so we should post it.

Or, the other obvious approach would be to keep a record of what questions have already been posted. Heroku makes it very easy to attach a database to a project, and at the scale we need it's free (less than 10k rows). This does add moderate complexity though, since we have to pick a postgres client (async or sync?), set up a simple schema, remind ourselves how to do SQL, and at least think about garbage collection. It's still pretty simple though, and TBH it might score better overall than my Very Clever solution. Sigh.

Recommend Projects