Comments (24)
What instance sizes are you using? If it is anything in the t* line then I could see the CPU bursting being a problem but if it is a c4.large I'd be really surprised and haven't seen that happen. I can adjust the idle threshold to make it a bit more generous but logs would really help. Maybe something where I measure the CPU utilization before launching the browser to get an "idle" floor to wait for the system to reach after having launched it.
With wptagent if you run a test with debug=1 (if the server has been updated in the last few months) the agent will upload the full debug log (downloadable to the left of the waterfall). If you visit the home page with ?debug=1 it should also set a hidden form field to pass the option through to the test if you are running them manually.
There have been a few other reports of the wptagent agents running significantly slower than expected so tracking it down would be awesome.
from webpagetest.agent.
Yes we do c4.large, I did some testing in the beginning for Chrome for Linux too, and it seems that we need to have that size.
Ah cool, we use the WebPageTest API wrapper so need to add support for the debug flag there, I can fix that tomorrow (if it isn't there already there), I'll turn on the debug log then and run the same tests again.
from webpagetest.agent.
I got the logging to work, this tests takes a really long time to run http://wpt.wmftest.org/result/170919_48_8S/ but in the log it looks ok. We don't log anything before the actual browser start, so we need to change that.
from webpagetest.agent.
Since the log is turned on and uploaded with the test run it can only include the part of the time from when it got the job until it goes to upload the result. Outside of that should just be the polling for work and uploading of the actual result.
There should be a test timing log file in the server directory for the test that should have the timestamps for when the job was fetched from the server and when the result was uploaded which should help narrow it down a bit more.
You can also grep your access log for the i-0a9209caf6a484ac6 instance ID string to see how often it is polling for work and the times between getting work and calling workdone.
The agent should be running inside of a script that is basically identical to the GCE one here. At startup and once every 24 hours it will install OS, browser and node updates and hourly it will pull the agent code from git. One of those might explain a long time between polling for work but not if the time comes between when the test started and when it was marked as done.
from webpagetest.agent.
Hi Pat, wanted to verify that you haven't seen this issue since you did the switch to Linux agents? We wanna switch and I was thinking it should be safe now since the move of most(?) wpt.org to the Linux version?
Best
Peter
from webpagetest.agent.
I haven't seen it but I also haven't measured it carefully to be positive (certainly not to the scale of the initial post though). Beyond the public instances, the HTTP Archive has been running Linux agents for quite a while and the test thruput has been consistent.
from webpagetest.agent.
Ok thanks, let us try it again then :)
from webpagetest.agent.
Hi Pat, yesterday we did push WPTAgent by the side of the old Windows version. We run the exact same tests, excepts that Windows also runs IE tests (11 test URLs more). We still have the same problems as before. We use the AWS image for WPTAgent ami-68dd6812 on a c4.large.
The Window runs take 70 min to finish, the WPTAgent times out after 120 min (we abort a job after 120 min). We push X jobs first, wait on them to finish and then push Y jobs. When I look in the job log I can see this:
02:22:08 Batch test URLs/scripts from ./scripts/batch/mobile.txt
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.org/wiki/Facebook
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.org/wiki/Barack_Obama
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.org/wiki/Sweden
02:22:08 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for https://en.m.wikipedia.beta.wmflabs.org/wiki/Barack_Obama
03:21:44 Tested URL https://en.m.wikipedia.org/wiki/Facebook for us-east-1-linux:Chrome with connectivity 3GFast result: http://wpt.wmftest.org/results.php?test=180201_7V_35&medianRun=fastest&medianMetric=SpeedIndex
That first batch is the first test we send to the server (the server is up and running we never takes it down). There's a gap there of about one hour before the first job gets back. Then the other jobs continues (until the timeout). The full log looks like this https://gist.github.com/soulgalore/d21b7735a99085a8b2c2e4f24f30ae33
But there are also other variants where the delay happens between runs. Here we get a 500 error after 50 minutes (where we had successful runs first):
01:11:57 Tested URL https://en.wikipedia.org/wiki/Sweden for us-east-1-linux:Firefox with connectivity Cable result: http://wpt.wmftest.org/results.php?test=180201_PG_10&medianRun=fastest&medianMetric=SpeedIndex
01:11:57 Tested using Firefox 58.0.1 on Linux 4.4.0-1039-aws
01:11:57 We got 16 working tests and 0 failing tests
01:11:57 Succesfully sent metrics.
01:11:57 Succesfully sent metrics.
01:11:58 Batch test URLs/scripts from ./scripts/batch/login-mobile.txt
01:11:58 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for ./scripts/wptscripts/login-mobile-enwiki-facebook.txt
01:11:58 Add job to the WebPageTest server [us-east-1-linux:Chrome 5 time(s)] for ./scripts/wptscripts/login-mobile-enwiki-sweden.txt
01:25:33 Tested URL https://en.m.wikipedia.org/wiki/Sweden for us-east-1-linux:Chrome with connectivity custom result: http://wpt.wmftest.org/results.php?test=180201_TH_1X&medianRun=fastest&medianMetric=SpeedIndex
01:25:33 Un-mapped entry in devtools timeline metrics:ScriptWrappableVisitor::performLazyCleanup
01:25:33 Un-mapped entry in devtools timeline metrics:ResourceChangePriority
01:25:33 Un-mapped entry in devtools timeline metrics:PlatformResourceSendRequest
01:25:33 Un-mapped entry in devtools timeline metrics:FireIdleCallback
01:25:33 Un-mapped entry in devtools timeline metrics:ScriptWrappableVisitor::performLazyCleanup
01:25:33 Un-mapped entry in devtools timeline metrics:ResourceChangePriority
01:25:33 Un-mapped entry in devtools timeline metrics:PlatformResourceSendRequest
01:25:33 Un-mapped entry in devtools timeline metrics:FireIdleCallback
01:25:33 Tested using Chrome 64.0.3282.119 on unknown
01:25:34 Succesfully sent metrics.
01:25:34 Succesfully sent metrics.
01:25:34 Succesfully sent metrics.
02:15:57 Failing test: { Error
02:15:57 at Object.<anonymous> (/srv/jenkins/workspace/workspace/performance-webpagetest-linux-wmf/node_modules/webpagetest/lib/helper.js:213:25)
02:15:57 at Module._compile (module.js:570:32)
02:15:57 at Object.Module._extensions..js (module.js:579:10)
02:15:57 at Module.load (module.js:487:32)
02:15:57 at tryModuleLoad (module.js:446:12)
02:15:57 at Function.Module._load (module.js:438:3)
02:15:57 at Module.require (module.js:497:17)
02:15:57 at require (internal/module.js:20:19)
02:15:57 at Object.<anonymous> (/srv/jenkins/workspace/workspace/performance-webpagetest-linux-wmf/node_modules/webpagetest/lib/webpagetest.js:14:15)
02:15:57 at Module._compile (module.js:570:32)
02:15:57 at Object.Module._extensions..js (module.js:579:10)
02:15:57 at Module.load (module.js:487:32)
02:15:57 at tryModuleLoad (module.js:446:12)
02:15:57 at Function.Module._load (module.js:438:3)
02:15:57 at Module.require (module.js:497:17)
02:15:57 at require (internal/module.js:20:19)
02:15:57 name: 'WPTAPIError',
02:15:57 code: 500,
02:15:57 message: 'Internal Server Error' }
so I don't think the wait time is always tied to the first run of the batch of the server.
Best
Peter
from webpagetest.agent.
The 500 is a server-level error. Any chance the server is getting heavily loaded while processing the linux tests and that is why it starts taking longer? There have been a lot of optimizations (maybe even since the last official release) to have the server not process the raw results and to do all the processing at the agent.
from webpagetest.agent.
Any chance the server is getting heavily loaded while processing
No I don't think so? Any load we put on is from the WPT API NodeJS version that checks if a test is finished. However that 500 error happens maybe 8/10 runs when I went through the log. When I check the CPU on Cloudwatch it's been up to 50% and then down. I'll have a closer look tomorrow but I got a feeling that the wait time isn't because of high CPU (at least if it is reported right in the monitoring details), but I need more data.
from webpagetest.agent.
There could be a fair amount of load in actually running the tests (as results get posted back to the server). If the server is post-processing the devtools data (and possibly traces) instead of using the pre-processed versions from the agents.
If there aren't a lot of agents though it may not be the source of issues. Just taking a stab in the dark since I haven't been able to reproduce it.
from webpagetest.agent.
Ok, I could see now that the agent has been closed down in the AWS console, but the our job still waits on getting jobs back. I haven't changed anything in the config, could it be that something changed how to keep AWS instances alive? We use the server to auto spin up one instance. For the old Windows agent that kept the same instance alive for months (since keep adding tests to it) but maybe that's not the best way to do it.
from webpagetest.agent.
:-/ I know there have been relatively recent changes (3 months or so) to the auto-scaling logic but I don't use it myself. I'd expect the Windows and Linux instances to behave similarly though.
FWIW, if you just want to keep an instance running, I've found that using a persistent spot request works really well.
from webpagetest.agent.
Yes, finally "it just works" :) I've deployed a standalone AWS (using your AMI) and everything just runs clean. I wanna keep this open during the weekend though so I can see that it works with our Jenkins instance.
I've found that using a persistent spot request works really well.
So you make one request with a long valid until? And then you push one test at a time so you are sure the tests will run or what's the best practice to do it?
from webpagetest.agent.
Do you usually keep one server running (min and max) or do you let them burst up as needed to handle the testing? If you just keep one running then a long-lived persistent spot request basically does the same thing. I usually just submit all of the tests at a time and let the queueing handle it but that also assumes you have enough capacity running to handle the testing volume you need (at the intervals you need it).
One thing I forgot is that the Linux AMI's can spin up and down more frequently than hourly (since they are billed per minute) and maybe that's why they were getting shut down differently.
from webpagetest.agent.
Ok, thanks!
Hmm actually the agent was closed down again. Could it be that even though that the instance wasn't started by the WebPageTest server (I created it manually), it still get the kill depending on EC2.IdleTerminateMinutes?
At the moment we run both Windows and the new Linux agent for a while so I wanted to change the config as little as possible and then when we only run Linux, just disable the auto EC2 handling.
from webpagetest.agent.
One thing I forgot is that the Linux AMI's can spin up and down more frequently than hourly (since they are billed per minute) and maybe that's why they were getting shut down differently.
Yes but I'm not sure because it shut down when it had work in the queue, and that doesn't add up for me?
from webpagetest.agent.
Yes, if the ami ID matches one of the locations it knows about then it will kill other instances even if you launched them manually. If you comment out the ami in locations.ini for the linux location then it should stop trying to manage it. You can also launch an instance, stop it and make a new AMI (basically copy the AMI) to a new ID.
from webpagetest.agent.
@josephscott knows the atu-scaling behavior better than I do and may be able to help.
from webpagetest.agent.
If the problem is the WPT server killing off EC2 instances, is there anything in the daily ec2/log/ directory? For example - ec2/log/ec2.log.20180203
When I was tracking down oddities with EC2 agents going up/down, I found the log info in there helpful.
from webpagetest.agent.
Hi @josephscott ah thanks, I forgot to look there when I actually understood that it was the EC2 up/down that was causing the problem.
I can see a pattern, without having any specific configuration changes (meaning EC2.IdleTerminateMinutes was not configured) it seems to always look like this:
17:55:13 - Instance i-0609ab2530ee3357a started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
18:50:12 - Terminated instance i-0609ab2530ee3357a in us-east-1
18:50:12 - Launching c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
18:50:14 - Instance i-086dde645760c7cda started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
19:45:12 - Terminated instance i-086dde645760c7cda in us-east-1
19:45:12 - Launching c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
19:45:13 - Instance i-0df8d4d8bd48f285e started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
20:40:12 - Terminated instance i-0df8d4d8bd48f285e in us-east-1
20:53:43 - Launching c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=wpt.wmftest.org wpt_loc=us-east-1-linux wpt_key=KEY
20:53:45 - Instance i-06945243e08735cb6 started: c4.large ami ami-68dd6812 in us-east-1 for us-east-1-linux with user data: wpt_server=IP wpt_loc=us-east-1-linux wpt_key=KEY
It seems it is almost always teardown after 55 minutes, and mostly cerating a new instance but that happens when there are jobs in the queue for that instance. And the new instance doesn't come up so it gets any of the jobs. I've added EC2.IdleTerminateMinutes and set it to 5 hours in plain desperation and it worked ... for five hours :) Right now I disabled the ec2 keys to make sure the instance(s) lives forever.
I can have a look at it late next week, now when it finally works for us (we wanna move to Linux ASAP) I wanna focus on getting everything ready on our side (we don't use the scaling daily, only when we test configuration changes on our side).
from webpagetest.agent.
Just checking in to see if this has been resolved with the recent changes to the EC2 autoscaling logic.
from webpagetest.agent.
We switched it off and I haven't turned it on again, sorry. It works fine for us to run without the scaling logic.
from webpagetest.agent.
Thanks. Closing it out as there have been a few other threads on the auto-scaling and I think most of the rough edges are fixed (though it's still hard to get just right).
from webpagetest.agent.
Related Issues (20)
- CWV element extraction script accidentally trips Chrome feature usage flags
- Issue with HTTP/2 Redirect reporting HOT 1
- Updates to release tags
- Explore migration to Ubuntu 22.04 HOT 10
- Missing critical information (e.g. how to start it, how configure it post install)
- webpagetest Test Error: Error configuring traffic-shaping on ubuntu 18.04 HOT 1
- Not able to start webpagetest agent on redhat 8.0
- Add support for Soft Navigation HOT 4
- Visual-complete progress bug
- Can the image run without CAP_NET_ADMIN
- Not able to run tests on Chrome 112 HOT 1
- Not able to reset executionContext HOT 1
- Chrome v114 slowdowns due to enabling Debugger in dev tools
- Not able to set the executionContext using Id of iframe HOT 8
- Dockerfile is missing text file for apt-get HOT 2
- How to get LCP and FCP metrics on running a private instance.?? HOT 1
- Error: NaCl helper process running without a sandbox!
- Sign-in is successful but the step returns Error: 404 and stops
- combineSteps Command of Script is not working on giving count
- Empty test result, only orange start_page HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from webpagetest.agent.