Comments (22)
@maltoe I like this warmup solution. Seems like the best compromise. 👍
I've added it and will report back after a few days too.
from chromic_pdf.
Hi @maltoe, thanks for following up. It does still seem to be a problem for us. I haven't had time to circle back on this but I see that you released an update that could help me debug so I'll see if I can get you the debugging info you requested.
from chromic_pdf.
@markquezada We're having the same issue. So far your warmup solution seems to be working after a few builds. I'll let you know if it still looks good after a few days.
from chromic_pdf.
@markquezada Are you running a lot of PDF-printing tests in parallel? Also, I haven't used github actions much myself, does it change anything on the test container when you ssh into it?
Also, since you mentioned discard_stderr
, I'm assuming you don't see any Chrome output in your logs?
Sorry, nothing comes to mind at first thought :/
from chromic_pdf.
Thanks for the reply! Sorry, just saw it.
I'm literally just running the one test shown above... no other PDF tests in parallel or anything like that. I just wanted to confirm it worked in CI and that it picked up the chromium package, etc.
I don't believe anything changes when you SSH in. I'm using the tmate
action in the step right before I'm running the tests and then logging in. The only thing I could think of was that there was some sort of boot lag for chrome that I unintentionally skirted by logging in and running the tests manually. But that's a total shot in the dark.
I double checked the logs and I did notice these which I hadn't noticed before:
[0902/080732.519703:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
[0902/080732.877365:WARNING:sandbox_linux.cc(376)] InitializeSandbox() called with multiple threads in process gpu-process.
[0902/080738.291913:ERROR:devtools_pipe_handler.cc(149)] Connection terminated while reading from pipe
[0902/080738.302286:ERROR:devtools_pipe_handler.cc(234)] Could not write into pipe
[0902/080738.302408:ERROR:devtools_pipe_handler.cc(234)] Could not write into pipe
I didn't notice them before because this output wasn't near the test itself, but rather intermingled in the test output. Does that help at all? The connection terminating while reading from pipe sounds... significant 😅
The sandbox one is odd since I was running with no_sandbox: true.
Maybe a tangent, but I was taking a look at the chrome docs and I thought these flags looked interesting. Not sure if they're relevant:
--disable-software-rasterizer --disable-dev-shm-usage --disable-setuid-sandbox
For completeness and in case it's relevant, I did want to note I have this in my test_helper.exs
:
ExUnit.after_suite(fn _ -> Supervisor.stop(MyApp.Supervisor) end)
I saw it somewhere in the docs related to not properly shutting down.
from chromic_pdf.
Hey @markquezada
- I wouldn't worry about the
InitializeSandbox
, it probably spits that out before it pays attention to the--no-sandbox
flag. In the past if we or someone else had issues with the sandbox, it would generally either just crash with a permission error, or not. So that should be fine. - The
Could not write into pipe
bits definitely stand out, don't they? :) I would assume though that these come from after your test suite has ran, and has closed the pipe from Erlang's side. So more a symptom than the cause of the problem.
But I'm also just guessing, to be honest. It sure looks like some kind of race condition at boot time.
Further debugging ideas / questions:
- Does it help if you sleep 10s in your test before printing the PDF?
- How do you start your supervision tree in the test?
- I'd be curious to see which protocol is being executed when you hit the exception -> would it be possible for you to compile a version that inspects the
protocol
variable here?
from chromic_pdf.
Hi @markquezada
Can we close this? Any news?
from chromic_pdf.
Quick update: I added a sleep call in the test and so far none of our builds have failed. I'm going to let that run for another day or so and then I'll remove it to see if I can trigger the same error I was getting previously.
from chromic_pdf.
@maltoe FYI, I've been slowly reducing the sleep time when printing in test and now it's back at 0 and I'm getting this occasionally:
1) test it renders a pdf blob (MyApp.TranscriptPdfRendererTest)
Error: test/my_app/transcript_pdf_renderer_test.exs:12
** (exit) exited in: NimblePool.checkout(#PID<0.663.0>)
** (EXIT) time out
code: assert {:ok, _} = TranscriptPDFRenderer.render(transcript)
stacktrace:
(nimble_pool 0.2.6) lib/nimble_pool.ex:402: NimblePool.exit!/3
Warning: (chromic_pdf 1.3.1) lib/chromic_pdf/api.ex:50: anonymous fn/3 in ChromicPDF.API.chrome_export/4
(chromic_pdf 1.3.1) lib/chromic_pdf/api/telemetry.ex:10: anonymous fn/2 in ChromicPDF.Telemetry.with_telemetry/3
Warning: (telemetry 1.1.0) /home/runner/work/my_app/my_app/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
(my_app 0.1.0) lib/my_app/transcript_pdf_renderer.ex:21: MyApp.TranscriptPDFRenderer.render/1
Warning: test/my_app/transcript_pdf_renderer_test.exs:20: (test)
I'm happy to simply keep the call to sleep
in and call it a day but I thought you might want to see this error.
from chromic_pdf.
Yeah, that one we could probably address. You're still using the session_pool: [timeout: 30_000]
option, is that right? That isn't applied to the NimblePool.checkout!/3
call, see here. Mostly because I didn't bother and so far nobody complained, but also because it's a bit more tedious as I can't keep it in the pool state.
A bit of context: The :timeout
option controls the time that the actual PDF printing process is allowed to take, namely the time it takes Chrome to load your assets, render the page, create a PDF from it, and return it. It is applied here on Channel.run_protocol/3
which internally calls GenServer.call/2
. So that's why we're telling users who have long PDFs to set this option. However, it does not prevent you from maxing out the concurrency of the worker pool, i.e. by trying to print more PDFs when all your workers are already busy. It's this tiny statement in the docs:
Please note that these are non-queueing worker pools. If you intend to max them out, you will need a job queue as well.
So for you to fix this: Use a queue like Oban to ensure limiting your concurrent print_to_pdf
calls.
@andreasknoepfle Regarding "that one we could probably address" -> I think we should rescue this error and explain this better. I'm unsure about adding a checkout_timeout
option somewhere to control NimblePool, as an option like this would only mitigate the problem for users, but in a fragile way? Not sure if we want to support that.
from chromic_pdf.
@markquezada addendum on my previous comment:
@andreasknoepfle pointed out that you were only running a single test on CI, which I completely forgot in the meantime :) So I guess it's not really plausible that you max out the concurrency of the pool?
In that case I'd offer the following theory:
- When ChromicPDF is started as part of your supervision tree, occasionally it takes the worker init (= the
SpawnSession
protocol, see theSessionPool
module) a long time - You have set
init_timeout
option to something greater than 5 seconds - Your test runs meanwhile, tries to print something, and runs into the checkout timeout
I could simulate this by reducing the checkout timeout to 1ms and running this on the console:
ChromicPDF.start_link(); ChromicPDF.print_to_pdf({:html, "test'})
from chromic_pdf.
FWIW I tried setting up Github actions for this repository, and look at the first error that jumps in my face:
[edit] or look at this job: https://github.com/bitcrowd/chromic_pdf/actions/runs/3600997571/jobs/6066285321 all sorts of EXITs and timeouts
from chromic_pdf.
This one may be relevant microsoft/playwright#4345
from chromic_pdf.
Ah, interesting. Yes, I'm only running the one test so the nimble pool error seemed suspect.
Your theory sounds pretty spot on to me.
FYI, here's my current config in test:
config :my_app, ChromicPDF,
on_demand: false,
offline: true,
discard_stderr: false,
no_sandbox: true,
chrome_args: "--disable-dev-shm-usage",
session_pool: [init_timeout: 30_000, timeout: 30_000]
I even tried making Chromic the first thing in my supervisor tree so that it boots up first but I still get this error occasionally.
This one may be relevant microsoft/playwright#4345
Isn't that running headful chrome though? That ticket notes that headless chromimum is optimized for cold boots.
One thing I just realized... I'm not specifying what chromium binary to use and I'm using the ubuntu-latest
github actions runner. Is it possible Chromic is picking a headful version of chromium to run against and that is what's causing the long boot time?
The software list for the ubuntu runner lists:
- Google Chrome 107.0.5304.121
- ChromeDriver 107.0.5304.62
- Chromium 107.0.5288.0
... so headful chrome looks to be installed. Based on the way the binary is located in chrome_impl.ex, it doesn't seem like it would choose headful chrome first, but I haven't verified which binary is getting run. Just a shot in the dark.
Thanks for being receptive to this issue!
from chromic_pdf.
This one may be relevant microsoft/playwright#4345
Isn't that running headful chrome though? That ticket notes that headless chromimum is optimized for cold boots.
Huh? AFAIK there is no dedicated Chrome build for "headless only" operation. chrome --headless
spawns chrome without its UI with any chrome binary.
I thought the ticket was related as it explicitly mentioned random delays on Github actions, but I have since experimented with that by measuring the time it takes ChromeImpl
to start the port and it seemed pretty consistent.
To me it feels like there is something fishy on Github Actions. Chrome may be doing some work on start up that may delay it becoming responsive to devtools commands. Perhaps it's loading something from disk, or writing its profile, or the like; and for some reason on Github Actions that occasionally takes a long time. But that's really just a gutfeel, I'd like to investigate further, especially to find out where exactly it stalls the SpawnSession
protocol. Will do that when I find the time, see the #179 branch.
Thanks for being receptive to this issue!
You're welcome. Sorry that it has not been the greatest experience for you yet. ChromicPDF usually runs pretty stable on production, but we and other people had a lot of issues in CI already, as quickly launching and stopping Chrome instances seems to be something that is difficult to get to work reliably.
from chromic_pdf.
@markquezada Still on my way of getting the test suite to run on Github actions (see #179), but it's looking good, i.e. I don't have random delays anymore. Maybe give 1.4.0 a try and see if one of the command line options I copied over from Puppeteer makes a difference. Cargo-cult programming at its finest.
[EDIT] 😠 Nevermind, just got an init timeout. Forget what I said. Still in the complete dark about what's wrong here.
from chromic_pdf.
A bit funny... we have dependabot set up to auto-create PRs when dependencies update. I was excited when I saw it come in with a ChromIcPDF update. But then CI failed on the PR 😭
Probably not that helpful, but for reference, here's the relevant messages in the logs:
[1208/122123.396777:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
[1208/122123.934858:WARNING:sandbox_linux.cc(380)] InitializeSandbox() called with multiple threads in process gpu-process.
1) test it renders a pdf blob (Novo.TranscriptPdfRendererTest)
Error: test/novo/transcript_pdf_renderer_test.exs:7
** (ChromicPDF.Browser.ExecutionError) Caught EXIT signal from NimblePool.checkout!/4
** (EXIT) time out
This means that your operation was unable to acquire a worker from the pool
within 5000ms, as all workers are currently occupied.
Two scenarios where this may happen:
1) You suffer from this error at boot time. For instance, you're running tests
on CI and occasionally Chrome takes a long time to spawn, exceeding the
session pool's "init timeout". Hence, when you're trying to print a PDF
in a test, the pool does not have any initialized worker yet.
Unfortunately, there is currently very little you can do about this.
See issue #160 for a discussion.
2) You're experiencing this error randomly under load. This would indicate that
the number of concurrent print jobs exceeds the total number of workers in
the pool, so that all workers are occupied.
To fix this, you need to increase your resources, e.g. by increasing the number
of workers with the `session_pool: [size: ...]` option.
However, please be aware that while ChromicPDF (rather, the underlying
NimblePool worker pool) does perform simple queueing of worker
checkouts, it is not suitable as a proper job queue. If you expect to
print multiple PDFs concurrently and especially if you expect peaks in
your load, a job queue like Oban will provide a better experience.
Please also consult the worker pool section in the documentation.
code: assert {:ok, _} = TranscriptPDFRenderer.render(transcript)
stacktrace:
(chromic_pdf 1.4.0) lib/chromic_pdf/pdf/browser/session_pool.ex:70: ChromicPDF.Browser.SessionPool.run_protocol/3
Warning: (chromic_pdf 1.4.0) lib/chromic_pdf/api.ex:50: anonymous fn/3 in ChromicPDF.API.chrome_export/4
(chromic_pdf 1.4.0) lib/chromic_pdf/api/telemetry.ex:10: anonymous fn/2 in ChromicPDF.Telemetry.with_telemetry/3
Warning: (telemetry 1.1.0) /home/runner/work/novo/novo/deps/telemetry/src/telemetry.erl:320: :telemetry.span/3
(novo 0.1.0) lib/novo/transcript_pdf_renderer.ex:21: Novo.TranscriptPDFRenderer.render/1
Warning: test/novo/transcript_pdf_renderer_test.exs:15: (test)
[1208/122127.815462:ERROR:devtools_pipe_handler.cc(148)] Connection terminated while reading from pipe
[1208/122127.850429:ERROR:devtools_pipe_handler.cc(233)] Could not write into pipe
[1208/122127.850453:ERROR:devtools_pipe_handler.cc(233)] Could not write into pipe
from chromic_pdf.
I've looked into Github actions further today and established a few points:
- The delay happens after
Port.open
, Chrome launches but takes a while to respond to our devtools messages. - The delay only happens on first launch of the executable. Subsequent runs are in the 100ms region.
- The delay only happens sometimes. On some runs Chrome took just a second to respond. On others, it took between 10 and 15 seconds, which lead to the timeout.
-> It looks a lot like Chrome is initializing something on first run, perhaps some cache or so, and perhaps including writing to disk. And only sometimes Github Actions has trouble providing the required resources promptly. Perhaps some part of the filesystem is provisioned lazily. All just guesses though.
Solutions
- we could set the default timeouts (
@checkout_timeout
and@init_timeout
) to something beyond ~20s. However, I'd prefer not to, as this also means that in normal operation we would wait for 20s for every checkout. In normal operation however, I consider a checkout error to indicate that the user should up their game, i.e. add a proper queue or more resources (see the error message). Thus IMO the default timeout should be low (perhaps even lower than 5s, maybe 1s) to make people aware early. Personal opinion though, yours may vary. - The solution I've pursued today is a manual "warmup" step to be added to your CI pipeline, to make sure Chrome has initialized all its things.
Could you do me a favor (hopefully the last) and try out #193 in your pipeline and add the warmup/0
call to your test_helper.exs
? 🙏
from chromic_pdf.
@maltoe Warmup has been working for a week now. I'd say we're good!
from chromic_pdf.
Merged. Heads up, I renamed it to warm_up/0
. Also added a mix task that can be run directly from a CI script, in case you don't need any options.
Curiously, it has started happening on our own CI now 🎉 Maybe it's something that has changed in recent Chromes (I've updated the Chrome version in the CI maybe a month ago)...
Thanks for your help everyone! @markquezada, I'm going to interpret your silence as "it hasn't happened again" and going to close this issue for now. Feel free to reopen if necessary.
from chromic_pdf.
Released as v1.5.0
from chromic_pdf.
@maltoe sorry, just saw this as I've been out on vacation. Yep, it all still seems to be working well with the warmup option for us!
Thanks for your help and persistence in tracking this down and providing a fix!
from chromic_pdf.
Related Issues (20)
- Flaky test assertion on log output HOT 3
- Support dynamic names HOT 2
- Investigating the ability to navigate through POST HOT 4
- on_demand implementation sends exit signal to caller HOT 15
- GPU renderer on a headless server? HOT 2
- Timeouts on prod HOT 11
- Exception at startup on Gigalixir HOT 2
- Chromium on a different machine in the network HOT 9
- Ability to pass URL as the template source? HOT 1
- Ability to call `Emulation.setScriptExecutionDisabled` on a per-page basis? HOT 3
- Is it possible to use the service name instead of an ip address when use chrome_address? HOT 1
- Inspector.targetCrashed received when `no_sandbox: true` on Chromium 115 on Linux HOT 7
- Make checkout timeout configurable
- Changelog dates are wrong HOT 1
- websocket_debugger_url failing on windows erlang26/elixir15.7 HOT 2
- Footer is not displayed anymore with recent Chromiums HOT 23
- Troublesome PNG images? HOT 2
- Printing PDF crashes instantly before timeout HOT 13
- Loading external files fails when printing from in-memory HTML HOT 2
- Thoughts regarding executing ghostscript with stderr_to_stdout: true? HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chromic_pdf.