Giter Club home page Giter Club logo

Comments (14)

sodul avatar sodul commented on June 10, 2024 1

With the help of this post:https://stackoverflow.com/questions/15349997/assertionerror-when-threading-in-python

We found out that the library under test has a subclass of threading.Thread() but with a different signature. We suspect that the code need to be refactored to behave better. It looks like the bug is not in green, but the fact that green get stuck is concerning.

I suppose it got stuck because the code under test got stuck.

from green.

CleanCut avatar CleanCut commented on June 10, 2024 1

Fixed in Green 3.3.0, which I just released. 🥳 🎈 🎉

from green.

sodul avatar sodul commented on June 10, 2024

We got the issue 8 out of 10 times on our CI pipeline. By adding -s 1 to disable multithreading for this directory we passed 5 out of 5 times. So we have a temporary workaround.

from green.

CleanCut avatar CleanCut commented on June 10, 2024

Glad to hear the problem was somewhere else. 😉

from green.

sodul avatar sodul commented on June 10, 2024

@CleanCut unfortunately I combed over the library that is causing green to freeze and I could only find a small bit of code where we subclass threading.Thread so we can collect the return value of run(). We do not use threads anywhere else in that library and we are not using multiprocessing either.

I was not able to reproduce locally on macOS either natively or under the same docker container we use in our CI. It is very consistent on our CI system with python 3.9.6.

I see that you wrote green.process.LoggingDaemonlessPool._repopulate_pool() back in 2015, and that the official Pool class now has a _repopulate_pool_static() that was added 2y ago, bypassing your _repopulate_pool(). I believe that LoggingDaemonlessPool needs to be updated with its own _repopulate_pool_static().

https://github.com/python/cpython/pull/11627/files#diff-868ec74e6c7b13cb1ba8185bbdb6003a23671a5d2c7e1123de4ffb4250f6932bR269

That's my 1AM analysis and might be wrong but so far I really think the bug is in green.

from green.

sodul avatar sodul commented on June 10, 2024

I did some experiments and patching green/process.py with no success. I'll investigate further and will let you know.

from green.

CleanCut avatar CleanCut commented on June 10, 2024

I did some experiments and patching green/process.py with no success. I'll investigate further and will let you know.

👍🏻 I'll wait to hear what you find.

from green.

sodul avatar sodul commented on June 10, 2024

I've patched multiprocessing.process.BaseProcess to get more context information:

    def __init__(self, group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None):
        if group is not None:
            import inspect
            import sys
            import traceback

            print()
            print('current process:', current_process())
            print(f'sys.argv: {sys.argv}')
            print(
                f'group: {group!r}, target: {target}, name: {name}, args: {args}, kwargs: {kwargs}'
            )
            print('target signature: ', inspect.signature(target))
            traceback.print_stack()
            initializer_or_finalizer = args[2]
            print(
                'InitializerOrFinalizer module:',
                repr(initializer_or_finalizer.module_part),
                'function:',
                repr(initializer_or_finalizer.function_part),
            )
            print()
        assert group is None, 'group argument must be None for now'

This is what I get:

current process: <_MainProcess name='MainProcess' parent=None started>
sys.argv: ['/root/.pyenv/versions/3.9.6/lib/python3.9/site-packages/green/__main__.py']
group: <multiprocessing.context.ForkContext object at 0x7ffb5da34250>, target: <function worker at 0x7ffb5d7543a0>, name: None, args: (<multiprocessing.queues.SimpleQueue object at 0x7ffb23f6a1f0>, <multiprocessing.queues.SimpleQueue object at 0x7ffb254337c0>, <green.runner.InitializerOrFinalizer object at 0x7ffb585d75e0>, (), None, True), kwargs: {}
target signature:  (inqueue, outqueue, initializer=None, initargs=(), maxtasks=None, wrap_exception=False, finalizer=None, finalargs=())
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 930, in _bootstrap
    self._bootstrap_inner()
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers
    cls._maintain_pool(ctx, Process, processes, pool, inqueue,
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool
    Pool._repopulate_pool_static(ctx, Process, processes, pool,
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/pool.py", line 319, in _repopulate_pool_static
    w = Process(ctx, target=worker,
  File "/root/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/process.py", line 100, in __init__
    traceback.print_stack()
InitializerOrFinalizer module: '' function: ''

Somehow we end up in the init of BaseProcess with a group argument set to ForkContext which causes an assert to be raised.

I do not have deep experience with the multiprocessing library so my understanding is limited.

The other issue is that this has been very difficult to reproduce. It is very consistently failing under our CI pipeline conditions but changing some elements avoids the issue:

  • always works fine on my laptop, either native macOS or with the same container.
  • we use a python process wrapper (calls pylint, mypy, green ... through subprocess.run()) and calling green directly, the issue does not show up.
  • splitting the library which fails into smaller chunks and everything passes, so I was not able to identify if there is a 'conflict' with some of our code.

Let me know if the data I dumped helps you and if there are other things you would like me to dump. Unfortunately I cannot share a reproducible case since the docker image and our code is private.

from green.

sodul avatar sodul commented on June 10, 2024

I'll pause my investigation for a few days. I tried many different combinations but I am still unable to pinpoint any specific test or combination of tests that trigger the issue. It seems that the number of tests might be a factor since we have a few hundred tests in that directory.

The next thing I will try is to revert the two threading patches added in python 3.9.6 and see if they are the ones triggering the issue.

from green.

CleanCut avatar CleanCut commented on June 10, 2024

Good news! I was able to reproduce the problem! 🎉

That diff you linked to was super helpful. You were correct that the problem was because of the addition of _repopulate_pool_static, which broke repopulating pool subprocesses after they have died. Since green sets the processes to never die, this usually wouldn't be a problem, but somehow your workload is managing to kill all of the subprocesses -- and then since the pool has no workers pulling jobs from the queue, everything just hangs forever.

I was able to reproduce the problem in aa572e1 by adding a -X, --max--maxtasksperchild option and setting that to 1, and then trying to run more suites than subprocesses that get launched. I hit a hang in Python 3.8 and 3.9, but Python 3.7 works fine.

Now to fix the problem...

from green.

sodul avatar sodul commented on June 10, 2024

Thanks a lot for the quick fix!!! I'll try 3.3.0 in our CI pipeline tomorrow and will let you know how it goes.

from green.

CleanCut avatar CleanCut commented on June 10, 2024

🤞🏻

from green.

sodul avatar sodul commented on June 10, 2024

I've tested green 3.3.0 and I can confirm that the AssertionError stack trace is gone, however the tests are still stuck in parallel mode and never return. Passing the -s 1 option to green bypasses the problem so we are not stuck but there is still a problem somewhere.

I'll try to run some more experiments over the weekend. I'm wondering if adding a timeout per test might help me pinpoint which test might cause the situation to trigger.

The good new is that thanks to your new CLI option I was able to reproduce on my laptop under macOS which will help a lot.
green -r -v -s 32 --maxtasksperchild 1. This will allow me to cycle through experiments much faster.

Here is what I suspect might be happening. We run our CI under Kubernetes where the container is not getting the full CPU capacity of the host, but many processes in the container still believe the process has access to all the CPU/RAM resources. Without -s being set green is probably setting a number of cores (16) the host is claiming but the container is only allocated 4 k8s CPUs. Setting -s 4 seems to fix the issue as well and TBH doing more than 4 is way overkill so I'll set that as the default to pass to green from our wrapper.

I did confirm that os.cpu_count(), which Pool() calls when process is None, is returning 16 on our CI pipelines.

As far as I'm concern green is now even better today than it was last week :-)
Thanks a lot.

from green.

CleanCut avatar CleanCut commented on June 10, 2024

Here is what I suspect might be happening. We run our CI under Kubernetes where the container is not getting the full CPU capacity of the host, but many processes in the container still believe the process has access to all the CPU/RAM resources. Without -s being set green is probably setting a number of cores (16) the host is claiming but the container is only allocated 4 k8s CPUs. Setting -s 4 seems to fix the issue as well and TBH doing more than 4 is way overkill so I'll set that as the default to pass to green from our wrapper.

Oh, good. I'm glad you found a workaround. If we can find both a) a way to determine that we are running in k8s (an environment variable that k8s always sets, maybe?), AND b) a way for Python to determine a better CPU number in k8s, then we could have green adjust it's auto cpu count under k8s.

If we do go down this path, let's open a new issue to have the discussion in.

As far as I'm concern green is now even better today than it was last week :-)
Thanks a lot.

You are most welcome!

from green.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.