Giter Club home page Giter Club logo

trio-gitter-bot's Introduction

trio-gitter-bot's People

Contributors

mariatta avatar njsmith avatar ratanshreshtha avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

trio-gitter-bot's Issues

logic for which posts to send seems inverted somehow

So someone finally posted a python-trio question on SO this evening:

https://stackoverflow.com/questions/57742649/asks-async-https-client-ssleoferror-on-connect-attempt-to-server-with-self-s

and the bot noticed and sprang into action! It posted 29 links to the chat, none of which are the new question:

https://gitter.im/python-trio/general?at=5d6b21c929dba2421ce0bc11

The gitter timestamps are within 5 min of the question being posted.

This is kind of hilarious, but also very confusing. ๐Ÿคฃ๐Ÿ˜•โ‰๏ธ

It's almost like the logic is:

IF there's a question posted that was posted in the last 10 min, THEN post all the questions that weren't posted in the last ten minutes

Maybe RSSReader.read(newer_than=...) does not work the way one would expect?

more reliable algorithms for choosing which questions to post

I don't know if this matters at all. I suspect probably not. But it was an interesting puzzle that I got nerd-sniped by, so I figured I'd right down what I thought of :-)

The bot's goal is to post questions without any duplicates or missing any, ie exactly-once delivery. As we know from distributed systems theory, reliable exactly-once delivery is ludicrously difficult or impossible, so we want to pick some "good enough" heuristic that isn't too complicated to implement. So the challenge is to optimize that reliability/simplicity tradeoff.

Right now the bot uses an extremely simple heuristic: it polls for questions every ten minutes, and then it posts any questions that have timestamps newer than ten minutes. I love this because if you think of it from a distributed systems perspective it's obviously too simple to work (schedulers aren't accurate! We're comparing clocks across two completely separate platforms!), but in fact it (probably) works very well and it's so simple that it's actually hard to beat.

It does at least theoretically have flaws though: successive ten minute windows are never going to line up exactly; there will always be a gap or some overlap. I guess probably a few seconds every ten minutes. If a question happens to be posted during that time, then it will be either double-posted or lost entirely. Can this be avoided, without massively complicating the implementation?

I did have one clever idea: when we start up, fetch the feed and store a list in-memory of which questions are already there. Then on future iterations, fetch the list, and post all the questions that aren't in our in-memory list, and add them to the list. That totally removes the dependency on accurate clocks, and is still super simple. It does have the downside that if a question is posted a few minutes before the bot is restarted, it might get lost entirely. I guess we could try to minimize that chance by using a hybrid of the two approaches: schedule our regular checks at a known absolute time (like: not just every ten minutes, but at 1:00, 1:10, 1:20, etc.), and then at startup (only) use the question timestamps and assume that anything since the last scheduled check time was probably not posted by our previous incarnation, so we should post it.

Or, the other obvious approach would be to keep a record of what questions have already been posted. Heroku makes it very easy to attach a database to a project, and at the scale we need it's free (less than 10k rows). This does add moderate complexity though, since we have to pick a postgres client (async or sync?), set up a simple schema, remind ourselves how to do SQL, and at least think about garbage collection. It's still pretty simple though, and TBH it might score better overall than my Very Clever solution. Sigh.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.