Giter Club home page Giter Club logo

Comments (15)

trfiala avatar trfiala commented on July 21, 2024

I'm going to see if I can get some tests around the test infrastructure itself so we can guard against this feature breaking. Our CI depends heavily on it to avoid generating noise (i.e. false positives).

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

This is one example of it:
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2753/

The symptom looks like this:

1. the test log will show the original failure (usually a timeout or a load-sensitive test failing, like TestConcurrentEvents.py or the lldb-mi ones).
2. The test will rerun in the trailing rerun phase.
3. The rerun will clearly pass.
4. The final summarization will continue to indicate there was a failure.
5. We will then fail the test run.

For this example, TestConcurrentExamples.py failed (step 1). It was rerun (step 2). It passed without any further timeout (step 3). The summarization lists the TIMEOUT at the end of the list. That is the first sign of an issue. That should have been cleared. (step 4). Step 5 likely falls out from the issue that caused step 4 to print out incorrectly.

Once I figure out what is broken, I'll see if it is also broken in LLVM.org LLDB svn trunk. It is possible this is just a manifestation of a broken merge. But it might also exist at LLVM.org.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

I'm definitely getting unexpected behavior on a forced timeout test (packages/Python/lldbsuite/test/issue_verification/TestRerunTimeout.py).

When run with:
LLDB_TEST_TIMEOUT=5s test/dotest.py --executable=`xcrun -find lldb` --test-subdir issue_verification

I'm seeing it oscillate between passing and timing out, where the timeout case is not even generating a timeout event. Tracking that down now.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

This is looking like a race in test inferior status collection.

The test infrastructure sends events about what it is doing as the key means for communication as to what the test runner's many test queues are doing. When a test inferior completes, we do a wrap-up on the inferior's status to generate synthetic events like "timeout" and others that, for whatever reason, cannot be generated by the inferior itself. This also includes meta events like job queue events and similar bits.

If I introduce a small delay in dosep.send_inferior_post_run_events(), then all the issues I see go away. This would appear to imply that there is a race in the test runner where the run strategy knows the inferior is done, but we haven't yet received the process end state, so we don't know that it timed out. We thus skip the timeout cleanup, even though it was a timeout. Inserting a second delay there yields correct results 10 out of 10 times.

I'll need to diagnose the race and put in something to prevent it.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

I'm going to temporarily bump up the timeouts while we work this out, to avoid the timeouts in the first place. This will be a reasonable short-term work-around while I investigate and address the underlying issue if the timeouts are not instead deadlocks. If they're deadlocks, no amount of time is going to help them pass in the first round. We should look into adding a 'sample' (OS X) or the Linux equivalent thereof when we hit a test suite timeout so we can see if these timeouts are due to something more nefarious than just taking longer under heavy load.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

This is most definitely not only affecting timeout results. It also affects FAIL/ERROR and other categories, as witnessed here:

https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-14_04/2765
https://ci.swift.org/job/oss-lldb-incremental-linux-ubuntu-15_10/3115

I'm following a lead that has potential. The test inferiors and the test inferior worker threads send test events through a socket that has listeners that run as asyncore sockets. All the test events must relay through the listening socket in the main dosep.py process, which then forwards to the final top-level ResultsFormatter. It appears as though a rerun phase with a small number of reruns can now somehow race and complete (and thus report to) the ResultsFormatter after the summarization step takes place. This means those test events do not get the opportunity to overwrite the old failing ones.

I'm investigating that lead and looking to see how I can ensure all events are fully drained before doing the summarization. If this ends up being the issue, the next question would be what is causing this to show up now?

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

I think I've got at least one of the races understood now.

When we're in the rerun phase, we only run one test at a time, intentionally. This is not specific to the rerun logic, but that is a place that is likely to show up.

The run loop will exit if there are no worker threads active and there are no receiving worker sockets (asyncore-based) or the listener running. i.e. we're done when we think we're done. Except...

This is the sequence I think is failing (which I've been debugging):

1. Single worker queue running. (As is the case in the rerun logic).
2. test inferior or worker queue queues up connection and message to listener socket.
3. worker completes and goes away before listener gets an asyncore cycle to process the awaiting connection from the test inferior or worker queue job-level messages.
4. Because worker queue is done, the check that says "kill the listener if there are no more worker queues" fires off. Now we won't get the listener socket fired up, and we'll exit the main loop without having received the test event data we really need.

This would explain what I'm seeing. I'm going to figure out how to solve that.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

If this is indeed the issue, having the socket-initiating side wait for an ack on the read side before it does anything else should be sufficient. That will ensure that the producer of the test events (the test inferior or the worker thread) is still alive (and therefore keeping the run loop guaranteed to be alive) and getting the listener dosep.py socket into the asyncore IO map to keep the run loop alive.

I'm going to give that a shot.

This doesn't look like a new problem in the code. It might have been hidden because we were always running slower, but the new merge may have brought in enough speed improvements to change the probability of the race being hit.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

My ack mechanism appears to be working fine. I'm going to get it in a PR and test it on the Linux builders. The nature of the race is much more likely to occur on high-speed hardware. This will also need to go up to LLVM.org once I prove it out to be solving the issue.

It's conceivable there's something else there lurking, but I'm hopeful this is the primary issue.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

I put up this PR:
apple/swift-lldb#19

Ubuntu 14.04 passed without any issues showing up. We need a rerun to really test it, though.

Ubuntu 15.10 is just wrapping up and... it is a fail. Build #3130 on Ubuntu 15.10 (master) had a single ERROR, did the rerun, the rerun passed, but we still reported an ERROR for the summary and failed the test suite.

I'll need to repro the failing ERROR case. I did all my work with the timeouts, which definitely had a race which I patched. Looks like we have at least one more issue.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

I'm able to repro a case of not all the failures clearing. It is with timeouts again, but it shows another error:

PYTHONUNBUFFERED=1 LLDB_TEST_TIMEOUT=3 /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c

Rerunning the following files:
lang/c/anonymous/TestAnonymous.py
lang/c/step-target/TestStepTarget.py
lang/c/function_types/TestFunctionTypes.py
lang/c/array_types/TestArrayTypes.py
lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py
lang/c/bitfields/TestBitfields.py
lang/c/shared_lib/TestSharedLib.py
rerun will use the 'threading' test runner strategy

1 out of 7 test suites processed - TestSharedLibStrippedSymbols.py
2 out of 7 test suites processed - TestSharedLib.py
3 out of 7 test suites processed - TestBitfields.py
4 out of 7 test suites processed - TestArrayTypes.py
5 out of 7 test suites processed - TestFunctionTypes.py

Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

Configuration: arch=x86_64 compiler=clang

Collected 14 tests

Cannot read termcap database;
using dumb terminal settings.

[TestAnonymous.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestAnonymous.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

6 out of 7 test suites processed - TestAnonymous.py

Session logs for test failures/errors/unexpected successes will go into directory '2016-04-14-00_21_11'
Command invoked: /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

Configuration: arch=x86_64 compiler=clang

Collected 10 tests

Cannot read termcap database;
using dumb terminal settings.

[TestStepTarget.py FAILED] (TIMEOUT)
Command invoked: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python /Users/tfiala/src/lldb-github/lldb/test/dotest.py --rerun-all-issues --executable=/Users/tfiala/src/lldb-github/lldb/build/Release/lldb --test-subdir lang/c -s 2016-04-14-00_21_11 --results-port 56586 --inferior -p TestStepTarget.py /Users/tfiala/src/lldb-github/lldb/packages/Python/lldbsuite/test/lang/c --event-add-entries worker_index=0:int

7 out of 7 test suites processed - TestStepTarget.py
Test rerun complete

=============
Issue Details

TIMEOUT: test_and_run_command_dwarf (lang/c/bitfields/TestBitfields.py)
TIMEOUT: test_and_run_command_dwarf (lang/c/array_types/TestArrayTypes.py)
TIMEOUT: test_expr_child_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_expr_grandchild_dwarf (lang/c/anonymous/TestAnonymous.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib_stripped_symbols/TestSharedLibStrippedSymbols.py)
TIMEOUT: test_frame_variable_dwarf (lang/c/shared_lib/TestSharedLib.py)
TIMEOUT: test_pointers_dsym (lang/c/function_types/TestFunctionTypes.py)
TIMEOUT: test_with_end_line_bad_name_dsym (lang/c/step-target/TestStepTarget.py)
TIMEOUT: test_with_end_line_bad_name_dwarf (lang/c/step-target/TestStepTarget.py)

===================
Test Result Summary

Test Methods: 71
Reruns: 21
Success: 49
Expected Failure: 13
Failure: 0
Error: 0
Exceptional Exit: 0
Unexpected Success: 0
Skip: 0
Timeout: 9
Expected Timeout: 0

Process finished with exit code 1

Only TestAnonymous.py and TestStepTarget.py still had failures, while the rest came out clean, yet we still record them all as failures. It looks like 5 files should have had their errors cleared but that didn't happen.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

Coming back to this now that I'm back.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

Looks like I found the issue with timeouts. The timeout code is generating filenames with .py, the code for success is reporting .pyc. I don't think the old path used to report as .pyc, but rather .py. This is probably where this issue came from, something else changed that refactored how we get the filename for the current running test, and that broke the collation logic of test results.

Working on a fix to normalize filenames now.

I'm also going to have a look at what changed in filename retrieval, since we should be using the same underlying mechanism for both.

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

I tracked this down and committed the fix here:
apple/swift-lldb#20

from llvm-project.

trfiala avatar trfiala commented on July 21, 2024

Fixed here:
apple/swift-lldb@25c5f20

from llvm-project.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.