Giter Club home page Giter Club logo

Comments (12)

ankon avatar ankon commented on June 20, 2024 1

Are you using the DecisionFunc to allow only recognized domain names?

Yes we do. The problem is that in this case from our side we should serve the page, it just so happens that the customer has put a proxy in the middle that "eats" the challenge requests. This isn't per se painful (the request trigger some storage lookups, but those happen with long delays between them!), but it disturbs monitoring information :)

Ah, I think you might need a timeout on the connection itself (configured in the server).

Actually that's a good point. I need to verify that we actually do have one in place for the handshake!

from certmagic.

mholt avatar mholt commented on June 20, 2024 1

D'oh, sorry -- sometimes I forget which repo I'm in and that CertMagic != Caddy 🙃

Let me read through this flow and see if I come to the same conclusion. What you're saying sounds like it is probably right, if not very close, so let me just refresh my memory and I'll get back to this 👍

from certmagic.

ankon avatar ankon commented on June 20, 2024 1

Intuitively: The waiters should return the result of what they are waiting for. In this case the result is either a certificate ready to use, or a reason why that wasn't possible to provide (the error).

That the communication to get these results happens via the cache as short-term storage is just an implementation detail. You could conceptually also use a "singleflight" (https://pkg.go.dev/golang.org/x/sync/singleflight), for instance.

from certmagic.

mholt avatar mholt commented on June 20, 2024

We absolute do find it useful! :)

Thank you! Glad you find it useful.

Unfortunately we do not control DNS for these websites

Make sure you're using on-demand TLS then (which it looks like you are). That's best for when you don't control the domains.

What actually seems to happen is that certmagic perpetually tries to renew the certificate, times out (expected), and then retries renewal.

I believe this is working as intended, actually!

I would expect an error: The certificate is expired, and shouldn't be served, nor is it in its "renewal window".

An expired certificate is considered "in the renewal window." A certificate that is still configured to be served will try to be renewed, ideally before it expires, but after if it must.

What this loop you're seeing is doing is trying to renew a certificate. When it sees that another goroutine (i.e. another TLS handshake) is already renewing it, it waits until it is done so it can use the cert it gets. That's the case <-wait: line you can see in renewDynamicCertificate(). Of course, we have no idea if it succeeded, so we proceed to load the certificate from storage (which will be a no-op if it already was by the other goroutine), and check its expiry. If it's still within the renewal window, then we know the other goroutine failed to renew it. Since we still need the cert renewed, we try again.

It does limit the retries for 2 minutes. It is arbitrary, yes, but do you have a suggestion for a better timeout? I have been trying to think of a better limit that makes more sense, but haven't come up with one. Hence the TODO in the code:

// TODO: see if we can get a proper context in here, for true cancellation
timeout := time.NewTimer(2 * time.Minute)

I'll happily use more sensible timeouts if we can figure those out.

And while I agree that expired certs shouldn't be served, ideally, I've found it's still better to serve the certificate as-is to not cover up what is actually happening. It just makes troubleshooting easier when it's obvious: "Oh, the cert is expired." Then you don't have to go digging for other problems, wondering why the server isn't serving any certificate.

See also: https://github.com/https-dev/docs/blob/master/acme-ops.md#if-an-active-certificate-cannot-be-renewed-in-time-act-according-to-your-sites-threat-model

Anyway, what you're seeing isn't (shouldn't be) an infinite loop because we do have a hard deadline set for these operations. Let me know if you confirm that you observe otherwise.

from certmagic.

ankon avatar ankon commented on June 20, 2024

have a hard deadline set for these operations

Well ... my handshake was taking ~14h, before I restarted the server :)

So, yeah, I see the hard deadlines, but what we also see is that they do not lead to certmagic actually aborting. "Something" here needs to stop the chain of calling loadCertFromStorage -> handshakeMaintenance -> loadCertFromStorage.

We do use a custom storage though, so thinking aloud: What if I were to check expiration, and refuse to return a cert that has already been expired?

from certmagic.

ankon avatar ankon commented on June 20, 2024

Looking at it the other way around: If I read your answer correctly, then the assumption is that in the case of an expired certificate we would either do a blocking refresh (taking ~up to 2min), or would wait for another thread to do a blocking refresh (taking presumably also up to ~2min), and then we should return an error from getCertDuringHandshake?)

from certmagic.

mholt avatar mholt commented on June 20, 2024

Ah, I think you might need a timeout on the connection itself (configured in the server).

Perhaps we should have a recursion deadline too; but at the same time, if a client is waiting for a handshake, why give up? AFAIK it doesn't hurt to retry to serve the client.

Looking at it the other way around: If I read your answer correctly, then the assumption is that in the case of an expired certificate we would either do a blocking refresh (taking ~up to 2min), or would wait for another thread to do a blocking refresh (taking presumably also up to ~2min), and then we should return an error from getCertDuringHandshake?), and then we should return an error from getCertDuringHandshake?)

Correct -- worth noting that ONLY expired certificates have blocking maintenance. Certs that haven't yet expired are renewed in the background and don't block handshakes, even though the handshake is what triggers the renewal.

Are you using the DecisionFunc to allow only recognized domain names? If you no longer serve a customer, for example, you should not allow the server to continue managing a cert for those names. (Unless scripts/bots are programmed with their own resolvers, domains that no longer point to your server should not be making handshakes that often.)

from certmagic.

mholt avatar mholt commented on June 20, 2024

Well ... my handshake was taking ~14h, before I restarted the server :)

Huh... that shouldn't be happening 🤔 We set timeouts whenever we start a renewal operation or wait for a renewal operation to complete. I wonder where that handshake is hanging at. Can you perhaps isolate its stack trace? The localhost:2019/debug/pprof endpoint has a link to get a goroutine dump, which can tell us where the code is at. This may be easier with fewer requests, in fact only one, to the problematic domain, that hangs for hours. Then I can know better where we may be missing a timeout.

from certmagic.

ankon avatar ankon commented on June 20, 2024

Note that this isn't caddy, it's a go server using certmagic -- we're still busy moving towards caddy.

Here's the trace from the screenshot:

                                                                               /src/database/dynamodb/certmagic.go (dynamodb.(*certmagicStorage).Load:157)  [this is our storage loading things]
                                                         /src/vendor/github.com/caddyserver/certmagic/crypto.go (certmagic.(*Config).loadCertResource:253)
                                                /src/vendor/github.com/caddyserver/certmagic/crypto.go (certmagic.(*Config).loadCertResourceAnyIssuer:182)
                                             /src/vendor/github.com/caddyserver/certmagic/certificates.go (certmagic.(*Config).loadManagedCertificate:146)
                                            /src/vendor/github.com/caddyserver/certmagic/certificates.go (certmagic.(*Config).CacheManagedCertificate:133)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:390)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)
                                                  /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).handshakeMaintenance:592)
                                                   /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).loadCertFromStorage:405)
                                               /src/vendor/github.com/caddyserver/certmagic/handshake.go (certmagic.(*Config).renewDynamicCertificate:650)

I need to verify that we actually do have one in place for the handshake!

Ok, did some stepping and further reading here.

From what I can see: The context passed in GetCertificate (via https://pkg.go.dev/crypto/tls#ClientHelloInfo.Context) does not have a timeout, but it internally does have a deadline set on both reading and writing (https://cs.opensource.google/go/go/+/refs/tags/go1.20.7:src/net/http/server.go;l=1871-1876, in our case because we have a read timeout set).
Now from my understanding that's "nice", but it doesn't help for GetCertificate, as that never reads or writes from the connection.

We do not explicitly add any further timeout into the context we pass to certmagic's GetCertificateWithContext.

I think what might be happening here requires additional requests: I can see in our logs that the attempt to renew the certificate fail after the first or second attempt ("will retry" messages with retrying_in of 60/120, attempts 1), with an error from Lets Encrypt (rate limited):
image

I also see the expected errors from the requesting place ("renewing and reloading certificate"), with varying reasons (context canceled directly or when in our storage doing I/O, ACME errors).

This will fail the renewal for the caller, so handshakeMaintenance returns (something), err, and I do see the errors from its caller loadCertFromStorage ("maintaining newly-loaded certificate"). If this call to loadCertFromStorage came from getCertDuringHandshake, which will now proceed with whatever they have (expired certificate, if I read correctly). The handshake will likely fail, and the HTTPS client will be unhappy and tell the user. That's fine.

BUT: If while we're attempting to renew other requests came in, they would now be queued up and waiting for the 2 minutes timer. The moment the initial handshakeMaintenance receives the error it will call unblockWaiters, and all these requests will now call loadCertFromStorage. This will fetch the expired certificate again, attempt the maintenance, and again have one request doing it (and failing), while the others will recurse and try again.

Am I reading this correctly?

from certmagic.

mholt avatar mholt commented on June 20, 2024

Ok, sorry that took me a minute. Yes, I think your explanation is accurate:

  1. First goroutine claims a lock for that name and tries to get the cert.
  2. Other goroutines that need that cert will see that first goroutine is already working on it, and wait.
  3. Once the first goroutine has either succeeded or failed, it will unblock the waiters.
  4. They will try loading the cert from storage.

Hmm, I think that is incorrect though. It probably should be:

  1. Once the waiters are unblocked, they simply use the certificate already loaded into the cache by the first goroutine.
  2. If the certificate is not in the cache, then something went wrong. Either they should repeat or just return the same error.

I guess it comes down to a question of... if one goroutine fails, should the others wait and retry or just give up?

I'm leaning toward they should give up and let the clients retry again if desired. Not sure why an immediate retry and recursion would actually be helpful here. (If you can think of something, feel free to let me know.)

So: should I change the code so that the waiters just unblock and use the result of the first goroutine? And if there isn't one, they also return an error?

from certmagic.

mholt avatar mholt commented on June 20, 2024

Gotcha. Ok, I'll work that into #249: the waiter will either return a certificate or an error, not recurse.

from certmagic.

mholt avatar mholt commented on June 20, 2024

I think this is fixed now based on my trials but we can reopen this if needed!

from certmagic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.