Comments (10)
Note: this is running under WSL2
from pokemonshowdown-ai.
Hi, thanks for trying this out!
I've seen this issue happen rarely on my Linux machine, where for some reason the IPCs don't connect properly and a timeout error occurs like this. I think it has to do with previous (errored or ctrl-C'd) runs of the training script sometimes leaving some dangling subprocesses that interfere with the IPC channels that the next run tries to use to connect to its subprocesses.
Restarting, running rm /tmp/psai-*
, or a kill
on the dangling processes should clear the IPC channels and let the training script run properly again.
Hope this helps, otherwise I can look into running this on my Windows machine to try and reproduce the error and come up with a better fix.
from pokemonshowdown-ai.
I was also able to reproduce the error under an ubuntu live usb but i will test that when i get home
from pokemonshowdown-ai.
python -m src.py.train
2023-09-05 11:20:10.908271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Configured to write training logs to /home/fluff663/pokemonshowdown-ai/experiments/train
2023-09-05 11:20:12.744648: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Configured to write TensorBoard metrics to /home/fluff663/pokemonshowdown-ai/experiments/train/metrics
Configured to write checkpoints to /home/fluff663/pokemonshowdown-ai/experiments/train/checkpoints
0: Eval: 0%| | 0.00/400 [00:24<?, ?battles/s]
Episode: 0%| | 0.00/10.0k [00:24<?, ?eps/s]
Task exception was never retrieved
future: <Task finished name='await_worker_3_battle_8' coro=<BattlePool.await_battle() done, defined at /home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py:249> exception=RuntimeError("Error in battle 'BattleKey(worker_id=b'worker_3', battle_id='battle_8')': AbortError: The operation was aborted\n at Object.destroyer (node:internal/streams/destroy:307:11)\n at createAsyncIterator (node:internal/streams/readable:1141:19)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)")>
Traceback (most recent call last):
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py", line 263, in await_battle
raise RuntimeError(
RuntimeError: Error in battle 'BattleKey(worker_id=b'worker_3', battle_id='battle_8')': AbortError: The operation was aborted
at Object.destroyer (node:internal/streams/destroy:307:11)
at createAsyncIterator (node:internal/streams/readable:1141:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
Task exception was never retrieved
future: <Task finished name='await_worker_1_battle_6' coro=<BattlePool.await_battle() done, defined at /home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py:249> exception=RuntimeError("Error in battle 'BattleKey(worker_id=b'worker_1', battle_id='battle_6')': AbortError: The operation was aborted\n at Object.destroyer (node:internal/streams/destroy:307:11)\n at createAsyncIterator (node:internal/streams/readable:1141:19)\n at processTicksAndRejections (node:internal/process/task_queues:95:5)")>
Traceback (most recent call last):
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py", line 263, in await_battle
raise RuntimeError(
RuntimeError: Error in battle 'BattleKey(worker_id=b'worker_1', battle_id='battle_6')': AbortError: The operation was aborted
at Object.destroyer (node:internal/streams/destroy:307:11)
at createAsyncIterator (node:internal/streams/readable:1141:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
Traceback (most recent call last):
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 385, in
main()
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 381, in main
asyncio.run(train(config=config))
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 259, in train
await run_eval(
File "/home/fluff663/pokemonshowdown-ai/src/py/train.py", line 106, in run_eval
state, _, terminated, truncated, info, done = await env.step(action)
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/battle_env.py", line 342, in step
battle_result = await asyncio.wait_for(
File "/home/fluff663/miniconda3/envs/psai/lib/python3.9/asyncio/tasks.py", line 479, in wait_for
return fut.result()
File "/home/fluff663/pokemonshowdown-ai/src/py/environments/utils/battle_pool.py", line 263, in await_battle
raise RuntimeError(
RuntimeError: Error in battle 'BattleKey(worker_id=b'worker_3', battle_id='battle_4')': AbortError: The operation was aborted
at Object.destroyer (node:internal/streams/destroy:307:11)
at createAsyncIterator (node:internal/streams/readable:1141:19)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
from pokemonshowdown-ai.
i realized i forgot part of the log
from pokemonshowdown-ai.
issue persists
from pokemonshowdown-ai.
im going to try a different system
from pokemonshowdown-ai.
from pokemonshowdown-ai.
Taking another look at this, that Cannot dlopen some GPU libraries
warning sounds pretty serious if you intended to use a GPU, and might indicate something wasn't setup correctly. Does TensorFlow on its own work normally on your computer/GPU?
Hope that helps.
from pokemonshowdown-ai.
Hi again,
Can I ask which Node.js version you were using? When I upgraded my system from Node v16 to v18 I started getting the same AbortError
s you mentioned earlier.
Currently pushing out a fix that seems to work on my machine. Lemme know if this fixes your issue.
from pokemonshowdown-ai.
Related Issues (20)
- Overlap rollout and update stages in training script
- Reward is always -1
- Reduce game log size
- Rollout model only ever exploring
- Implement TD learning
- Implement multi-step learning
- Use legal actions only when calculating TD target
- Encore not handling Pursuit properly
- Add more model evaluation baselines
- Training memory leak HOT 2
- Use multiple threads for inference during training
- Allow multiple training games per thread
- Partial Python rewrite HOT 2
- Add recurrent DQN option
- Add prioritized replay
- Add noisy networks
- Simplify battle stream interface HOT 1
- Request for Assistance with Locking a Party of 6 Pokemon HOT 4
- Description of approach/results HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pokemonshowdown-ai.