Comments (15)
I'm going to see if I can get some tests around the test infrastructure itself so we can guard against this feature breaking. Our CI depends heavily on it to avoid generating noise (i.e. false positives).
from llvm-project.
This is one example of it:
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2753/
The symptom looks like this:
1. the test log will show the original failure (usually a timeout or a load-sensitive test failing, like TestConcurrentEvents.py or the lldb-mi ones).
2. The test will rerun in the trailing rerun phase.
3. The rerun will clearly pass.
4. The final summarization will continue to indicate there was a failure.
5. We will then fail the test run.
For this example, TestConcurrentExamples.py failed (step 1). It was rerun (step 2). It passed without any further timeout (step 3). The summarization lists the TIMEOUT at the end of the list. That is the first sign of an issue. That should have been cleared. (step 4). Step 5 likely falls out from the issue that caused step 4 to print out incorrectly.
Once I figure out what is broken, I'll see if it is also broken in LLVM.org LLDB svn trunk. It is possible this is just a manifestation of a broken merge. But it might also exist at LLVM.org.
from llvm-project.
I'm definitely getting unexpected behavior on a forced timeout test (packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py).
When run with:
LLDB_TEST_TIMEOUT=5s test/dotest.py --executable=`xcrun -find lldb` --test-subdir issue_verification
I'm seeing it oscillate between passing and timing out, where the timeout case is not even generating a timeout event. Tracking that down now.
from llvm-project.
This is looking like a race in test inferior status collection.
The test infrastructure sends events about what it is doing as the key means for communication as to what the test runner's many test queues are doing. When a test inferior completes, we do a wrap-up on the inferior's status to generate synthetic events like "timeout" and others that, for whatever reason, cannot be generated by the inferior itself. This also includes meta events like job queue events and similar bits.
If I introduce a small delay in dosep.send_inferior_post_run_events(), then all the issues I see go away. This would appear to imply that there is a race in the test runner where the run strategy knows the inferior is done, but we haven't yet received the process end state, so we don't know that it timed out. We thus skip the timeout cleanup, even though it was a timeout. Inserting a second delay there yields correct results 10 out of 10 times.
I'll need to diagnose the race and put in something to prevent it.
from llvm-project.
I'm going to temporarily bump up the timeouts while we work this out, to avoid the timeouts in the first place. This will be a reasonable short-term work-around while I investigate and address the underlying issue if the timeouts are not instead deadlocks. If they're deadlocks, no amount of time is going to help them pass in the first round. We should look into adding a 'sample' (OS X) or the Linux equivalent thereof when we hit a test suite timeout so we can see if these timeouts are due to something more nefarious than just taking longer under heavy load.
from llvm-project.
This is most definitely not only affecting timeout results. It also affects FAIL/ERROR and other categories, as witnessed here:
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2765
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-15_10/3115
I'm following a lead that has potential. The test inferiors and the test inferior worker threads send test events through a socket that has listeners that run as asyncore sockets. All the test events must relay through the listening socket in the main dosep.py process, which then forwards to the final top-level ResultsFormatter. It appears as though a rerun phase with a small number of reruns can now somehow race and complete (and thus report to) the ResultsFormatter after the summarization step takes place. This means those test events do not get the opportunity to overwrite the old failing ones.
I'm investigating that lead and looking to see how I can ensure all events are fully drained before doing the summarization. If this ends up being the issue, the next question would be what is causing this to show up now?
from llvm-project.
I think I've got at least one of the races understood now.
When we're in the rerun phase, we only run one test at a time, intentionally. This is not specific to the rerun logic, but that is a place that is likely to show up.
The run loop will exit if there are no worker threads active and there are no receiving worker sockets (asyncore-based) or the listener running. i.e. we're done when we think we're done. Except...
This is the sequence I think is failing (which I've been debugging):
1. Single worker queue running. (As is the case in the rerun logic).
2. test inferior or worker queue queues up connection and message to listener socket.
3. worker completes and goes away before listener gets an asyncore cycle to process the awaiting connection from the test inferior or worker queue job-level messages.
4. Because worker queue is done, the check that says "kill the listener if there are no more worker queues" fires off. Now we won't get the listener socket fired up, and we'll exit the main loop without having received the test event data we really need.
This would explain what I'm seeing. I'm going to figure out how to solve that.
from llvm-project.
If this is indeed the issue, having the socket-initiating side wait for an ack on the read side before it does anything else should be sufficient. That will ensure that the producer of the test events (the test inferior or the worker thread) is still alive (and therefore keeping the run loop guaranteed to be alive) and getting the listener dosep.py socket into the asyncore IO map to keep the run loop alive.
I'm going to give that a shot.
This doesn't look like a new problem in the code. It might have been hidden because we were always running slower, but the new merge may have brought in enough speed improvements to change the probability of the race being hit.
from llvm-project.
My ack mechanism appears to be working fine. I'm going to get it in a PR and test it on the Linux builders. The nature of the race is much more likely to occur on high-speed hardware. This will also need to go up to LLVM.org once I prove it out to be solving the issue.
It's conceivable there's something else there lurking, but I'm hopeful this is the primary issue.
from llvm-project.
I put up this PR:
apple/swift-lldb#19
Ubuntu 14.04 passed without any issues showing up. We need a rerun to really test it, though.
Ubuntu 15.10 is just wrapping up and... it is a fail. Build #3130 on Ubuntu 15.10 (master) had a single ERROR, did the rerun, the rerun passed, but we still reported an ERROR for the summary and failed the test suite.
I'll need to repro the failing ERROR case. I did all my work with the timeouts, which definitely had a race which I patched. Looks like we have at least one more issue.
from llvm-project.
I'm able to repro a case of not all the failures clearing. It is with timeouts again, but it shows another error:
PYTHONUNBUFFERED=1 LLDB_TEST_TIMEOUT=3 /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c
Rerunning the following files:
lang/c/anonymous/TestAnonymous.py
lang/c/step-target/TestStepTarget.py
lang/c/function_types/TestFunctionTypes.py
lang/c/array_types/TestArrayTypes.py
lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py
lang/c/bitfields/TestBitfields.py
lang/c/shared_lib/TestSharedLib.py
rerun will use the 'threading' test runner strategy
1 out of 7 test suites processed - TestSharedLibStrippedSymbols.py
2 out of 7 test suites processed - TestSharedLib.py
3 out of 7 test suites processed - TestBitfields.py
4 out of 7 test suites processed - TestArrayTypes.py
5 out of 7 test suites processed - TestFunctionTypes.py
Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int
Configuration: arch=x86_64 compiler=clang
Collected 14 tests
Cannot read termcap database;
using dumb terminal settings.
[TestAnonymous.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int
6 out of 7 test suites processed - TestAnonymous.py
Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int
Configuration: arch=x86_64 compiler=clang
Collected 10 tests
Cannot read termcap database;
using dumb terminal settings.
[TestStepTarget.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int
7 out of 7 test suites processed - TestStepTarget.py
Test rerun complete
=============
Issue Details
TIMEOUT: test_and_run_command_dwarf (lang/c/bitfields/TestBitfields.py)
TIMEOUT: test_and_run_command_dwarf (lang/c/array_types/TestArrayTypes.py)
TIMEOUT: test_expr_child_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_expr_grandchild_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib/TestSharedLib.py)
TIMEOUT: test_pointers_dsym (lang/c/function_types/TestFunctionTypes.py)
TIMEOUT: test_with_end_line_bad_name_dsym (lang/c/step-target/TestStepTarget.py)
TIMEOUT: test_with_end_line_bad_name_dwarf (lang/c/step-target/TestStepTarget.py)
===================
Test Result Summary
Test Methods: 71
Reruns: 21
Success: 49
Expected Failure: 13
Failure: 0
Error: 0
Exceptional Exit: 0
Unexpected Success: 0
Skip: 0
Timeout: 9
Expected Timeout: 0
Process finished with exit code 1
Only TestAnonymous.py and TestStepTarget.py still had failures, while the rest came out clean, yet we still record them all as failures. It looks like 5 files should have had their errors cleared but that didn't happen.
from llvm-project.
Coming back to this now that I'm back.
from llvm-project.
Looks like I found the issue with timeouts. The timeout code is generating filenames with .py, the code for success is reporting .pyc. I don't think the old path used to report as .pyc, but rather .py. This is probably where this issue came from, something else changed that refactored how we get the filename for the current running test, and that broke the collation logic of test results.
Working on a fix to normalize filenames now.
I'm also going to have a look at what changed in filename retrieval, since we should be using the same underlying mechanism for both.
from llvm-project.
I tracked this down and committed the fix here:
apple/swift-lldb#20
from llvm-project.
Fixed here:
apple/swift-lldb@25c5f20
from llvm-project.
Related Issues (20)
- Infinite recursion in ParseTrieEntries when parsing libGPUCompilerUtils.dylib HOT 1
- Apple Clang: std::source_location returns a wrong result if called in a function parameter.
- xros and visionos compilation support? HOT 4
- Module not found diagnostics are repeated multiply
- Compiler crash while serializing `std::optional` HOT 1
- A request to Apple to open-source the Metal Shading Language compiler
- [CI] apple-llvm-project-PR-windows fails to checkout `release/5.10`
- Usage of RemoteCachingService?
- CAS Unittests broken on Windows HOT 1
- `-finstrument-functions` linking error on Apple clang
- [IR lower bug] Apple's clang generated code will not abort when empty std::string call pop_back(), while llvm/clang do
- Compiler crash on Windows with `-index-store-path` HOT 1
- Compilation fails on Amazon Linux 2023 aarch64 HOT 1
- Doc Comments Are Ignored for Macros HOT 3
- DAP support
- Apple fork clang++ regression in partial specialization templates for hasher
- [HELP]what's the exactly command to build Xcode toolchain from this apple repo? HOT 1
- LLDB never finishes importing Swift modules/setting up Swift reflection HOT 8
- Clang crash: Idx < Size && "access in bound" in AssignmentTrackingAnalysis HOT 5
- Add support for <bit>
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llvm-project.