Giter Club home page Giter Club logo

Comments (10)

Fluff663 avatar Fluff663 commented on July 21, 2024

Note: this is running under WSL2

from pokemonshowdown-ai.

taylorhansen avatar taylorhansen commented on July 21, 2024

Hi, thanks for trying this out!

I've seen this issue happen rarely on my Linux machine, where for some reason the IPCs don't connect properly and a timeout error occurs like this. I think it has to do with previous (errored or ctrl-C'd) runs of the training script sometimes leaving some dangling subprocesses that interfere with the IPC channels that the next run tries to use to connect to its subprocesses.

Restarting, running rm /tmp/psai-*, or a kill on the dangling processes should clear the IPC channels and let the training script run properly again.

Hope this helps, otherwise I can look into running this on my Windows machine to try and reproduce the error and come up with a better fix.

from pokemonshowdown-ai.

Fluff663 avatar Fluff663 commented on July 21, 2024

I was also able to reproduce the error under an ubuntu live usb but i will test that when i get home

from pokemonshowdown-ai.

Fluff663 avatar Fluff663 commented on July 21, 2024

python -m src.py.train
2023-09-05 11:20:10.908271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Configured to write training logs to /home/fluff663/pokemonshowdown-ai/experiments/train
2023-09-05 11:20:12.744648: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Configured to write TensorBoard metrics to /home/fluff663/pokemonshowdown-ai/experiments/train/metrics
Configured to write checkpoints to /home/fluff663/pokemonshowdown-ai/experiments/train/checkpoints
0: Eval: 0%| | 0.00/400 [00:24<?, ?battles/s]
Episode: 0%| | 0.00/10.0k [00:24<?, ?eps/s]
Task exception was never retrieved
future: <Task finished name='await_worker_3_battle_8' coro=<BattlePool.await_battle() done, defined at /home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py:249> exception=RuntimeError("Error in battle 'BattleKey(worker_id=b'worker_3', battle_id='battle_8')': AbortError: The operation was aborted\n at Object.destroyer (node:internal/streams/destroy:307:11)\n at createAsyncIterator (node:internal/streams/readable:1141:19)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)")>
Traceback (most recent call last):
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py", line 263, in await_battle
raise RuntimeError(
RuntimeError: Error in battle 'BattleKey(worker_id=b'worker_3', battle_id='battle_8')': AbortError: The operation was aborted
at Object.destroyer (node:internal/streams/destroy:307:11)
at createAsyncIterator (node:internal/streams/readable:1141:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
Task exception was never retrieved
future: <Task finished name='await_worker_1_battle_6' coro=<BattlePool.await_battle() done, defined at /home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py:249> exception=RuntimeError("Error in battle 'BattleKey(worker_id=b'worker_1', battle_id='battle_6')': AbortError: The operation was aborted\n at Object.destroyer (node:internal/streams/destroy:307:11)\n at createAsyncIterator (node:internal/streams/readable:1141:19)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)")>
Traceback (most recent call last):
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py", line 263, in await_battle
raise RuntimeError(
RuntimeError: Error in battle 'BattleKey(worker_id=b'worker_1', battle_id='battle_6')': AbortError: The operation was aborted
at Object.destroyer (node:internal/streams/destroy:307:11)
at createAsyncIterator (node:internal/streams/readable:1141:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
Traceback (most recent call last):
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 385, in
main()
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 381, in main
asyncio.run(train(config=config))
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 259, in train
await run_eval(
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 106, in run_eval
state, _, terminated, truncated, info, done = await env.step(action)
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/battle_env.py", line 342, in step
battle_result = await asyncio.wait_for(
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
return fut.result()
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py", line 263, in await_battle
raise RuntimeError(
RuntimeError: Error in battle 'BattleKey(worker_id=b'worker_3', battle_id='battle_4')': AbortError: The operation was aborted
at Object.destroyer (node:internal/streams/destroy:307:11)
at createAsyncIterator (node:internal/streams/readable:1141:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)

from pokemonshowdown-ai.

Fluff663 avatar Fluff663 commented on July 21, 2024

i realized i forgot part of the log

from pokemonshowdown-ai.

Fluff663 avatar Fluff663 commented on July 21, 2024

issue persists

from pokemonshowdown-ai.

Fluff663 avatar Fluff663 commented on July 21, 2024

im going to try a different system

from pokemonshowdown-ai.

taylorhansen avatar taylorhansen commented on July 21, 2024

from pokemonshowdown-ai.

taylorhansen avatar taylorhansen commented on July 21, 2024

Taking another look at this, that Cannot dlopen some GPU libraries warning sounds pretty serious if you intended to use a GPU, and might indicate something wasn't setup correctly. Does TensorFlow on its own work normally on your computer/GPU?

Hope that helps.

from pokemonshowdown-ai.

taylorhansen avatar taylorhansen commented on July 21, 2024

Hi again,

Can I ask which Node.js version you were using? When I upgraded my system from Node v16 to v18 I started getting the same AbortErrors you mentioned earlier.

Currently pushing out a fix that seems to work on my machine. Lemme know if this fixes your issue.

from pokemonshowdown-ai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.