applieddatasciencepartners / worldmodels Goto Github PK

View Code? Open in Web Editor NEW

278.0 278.0 87.0 378.95 MB

An implementation of the ideas from this paper https://arxiv.org/pdf/1803.10122.pdf

License: MIT License

Python 8.95% Jupyter Notebook 90.97% Shell 0.09%

worldmodels's People

Stargazers

Watchers

Forkers

emuhedo danielmaj kunlqt peerm user01 toby4548 camigord haohaohaohaohaohaozhang zbxzc35 wenbotse daniellsm ourobouros tawnkramer jdc08161063 royshan sakusss mehrdad-shokri vlad17 ilibx shabbir-hasan bigdot123456 createamind laipeter zhf459 kenqyu binderwang drawzeropoint aayustark007 bleyddyn zxsted athon-millane collector-m samching sxhsine quantumiracle chansonz michael-shelley wellbeing18 reloadbrain faur bsivanantham kemets pengfeixiang alibaheri shubhampachori12110095 siitcoe kevindarby dosssman rbnprdy doandongnguyen huanghaoyu1997 yonetaniryo rish-16 decoli kingllll seanmcrae bczhu ashwinjay101 dr-tesla dsvilarkovic yuanenjie taesiri eoinmc11 hongdazhang happyfreeangel jinniejj xrosliang makingml quaesito lvzw1895 shin-kyoto karaage0703 cap-jmk alicenguyen-qut pyzeon stjordanis abraham314 zoobeida01 citymodels msoftware yusuke1526 jianyunli candreas nvladimirovi lyk-love

worldmodels's Issues

VAE performance is bad (with standard parameters)

Running the code as is I get the following results after training the VAE (python 02_train_vae.py --start_batch 0 --max_batch 9 --new_model):

loss: 0.1416 - vae_r_loss: 0.1416 - vae_kl_loss: 3.8347e-07 - val_loss: 0.1378 - val_vae_r_loss: 0.1378 - val_vae_kl_loss: 2.3842e-07

Has anyone changed anything, or what are your experiences?

rnn_r_loss: -0.9799

Hi @davidADSP,
When I train the rnn model, it's weird that the rnn_r_loss decrease to negative value. I check the rnn_r_loss defined as:
result = -K.log(result + 1e-8)
it should not be a negative value. But it happens.
How does it happen and how can I solve the problem? Thank you.

Generate data script only generates one batch

Everything get's put into batch 0. From a quick glance, the inner loop just loops over the full count of total_episodes, so the outer loop only ever runs once.

About the new version

Hi, Thanks for your work and I find some files were updated, such as 03_generate_rnn_data.py and 04_train_rnn.py . Is it because the previous file has an error? (Because I ever used the files that has not updated to train my agent.) And I read the updated code and find they are almost the same except that RNN can predict reward now. Is it right?

NaNs in KL Loss

I'm getting NaN's in the KL Loss (specifically the K.square(vae_z_mean), part) in the first epoch of the first batch when trying to train the vae. First the KL loss (and overall loss) blows up, then starts going back down, but turns into NaN's before long:

13696/48000 [=======>......................] - ETA: 4:32 - loss: 289661273.9050 - vae_r_loss: 0.1846 - vae_kl_loss: 289661273.7244
23904/48000 [=============>................] - ETA: 3:13 - loss: 191863659.1293 - vae_r_loss: 0.1638 - vae_kl_loss: 191863658.9721
24992/48000 [==============>...............] - ETA: 3:04 - loss: nan - vae_r_loss: nan - vae_kl_loss: nan

Mac OS X 10.13
Python 3.6.2
tf.version '1.7.0'
keras.version '2.1.5'
gym.version '0.10.5'
np.version '1.14.2'

I've tried over from the beginning a couple of times, including a clean virtualenv. I've looked at some of the stats from the data files, and don't see anything too suspicious there.

Any suggestions would be welcome.

Changes I've tried:

SGD instead of RMSProp
Normalizing images (-mean, /std)
Increasing and decreasing batch size
Changing the constant in vae_r_loss() [There was some improvement when it was 1.0, more testing needed]
Train on a friend's generated data, which he used to successfully train a vae (on a GPU).

I haven't tried running on a GPU since I don't have a supported one.

I have successfully run steps 1-4 on my work computer (which i shouldn't be doing) so no NaN's for:

Mac OS X 10.9.5
CPU
Python 3.6.2
tf.version '1.3.0'
keras.version '2.1.3'
gym.version '0.10.4'
np.version '1.14.2'

I'm currently trying to run step 5 on my home machine using saved weights from work. Not a real solution, though.

Unable to train controller on colab

!xvfb-run -a -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25

/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
['mpirun', '-np', '17', '/usr/bin/python3', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']

mpirun has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.

Traceback (most recent call last):
File "05_train_controller.py", line 525, in
if "parent" == mpi_fork(args.num_worker+1): os.exit()
File "05_train_controller.py", line 492, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '17', '/usr/bin/python3', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.
[36c641d9ccde:04530] *** Process received signal ***
[36c641d9ccde:04530] Signal: Segmentation fault (11)
[36c641d9ccde:04530] Signal code: Address not mapped (1)
[36c641d9ccde:04530] Failing at address: 0x7f395ad0320d
[36c641d9ccde:04530] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f395ddb2890]
[36c641d9ccde:04530] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f395d9f1785]
[36c641d9ccde:04530] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f395e25ce44]
[36c641d9ccde:04530] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f395d9f2615]
[36c641d9ccde:04530] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f395e25acb3]
[36c641d9ccde:04530] *** End of error message ***

how does Keras handle loss in a batch?

Hi, thanks for the implementation!

I have a question related to how Keras handles the loss in a batch. I noticed that the loss in the RNN is defined as:

result = K.mean(result, axis = (1,2)) # mean over rollout length and z dim

That means that you end up with one loss per sample in the batch. I am not very familiar with Keras, so I wonder about what happens with these losses when you call

rnn.fit(rnn_input, rnn_output,
        shuffle=True,
        epochs=mdn_epochs,
        batch_size=batch_size,
        validation_split=0.2,
        callbacks=callbacks_list)

Does Keras average over the batch? Could you point me to some documentation about it?

Thanks a lot!

The command "python 03_generate_rnn_data.py --N 200" raises TypeError

Hi,

I am trying to execute your Code in Google Colab but above command raises error. It is due to mistake in Line 50 which is N = args.N but should be changed to N = int(args.N). Otherwise error is raised. Kindly rectify

Replace "master/slave" in 05_train_controller.py with "leader/follower" (or "primary/replica")

From PR #2692 to Django made in 2014:

The docs and some tests contain references to a master/slave db configuration.
While this terminology has been used for a long time, those terms may carry racially charged meanings to users.
This patch replaces all occurrences of master and slave with 'leader' and 'follower'

I'd be happy to make the PR if the owners of this repo are interested!

VAE reconstruction loss

Wow! Cool to see a reproduction so fast! Thanks a lot!

Shouldn't the reconstruction loss:

` def vae_r_loss(y_true, y_pred):

        return K.sum(K.square(y_true - y_pred), axis = [1,2,3])

`
be based on the mean rather than sum?

If it's sum-based than its value depends on the size of the image. Additionally it throws off the balance between KL and reconstruction loss -- the larger the image, the more reconstruction loss overwhelms the optimisation.

Having issue in reproducing the results

Hi,

Is anyone able to reproduce the results claimed by @davidADSP ?
I followed the instructions and steps given in this documentation.
Apparently, the python command stated in the documentation are out of date. I used the command written on top of each python script to run the data generation and training.
However, the trained controller is unable to drive properly. I constantly getting bad result (e.g. -50 ~ 10 at the end for each round)

Error training controller with rank

Thanks for sharing this great repo.
I am receiving the following error when I run the controller module.
In fact, the previous suggestion from @davidADSP (in one of the closed issues) did not help me.
Any other suggestion for the following error:

Traceback (most recent call last): File "05_train_controller.py", line 521, in if "parent" == mpi_fork(args.num_worker+1): os.exit() File "05_train_controller.py", line 491, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/usr/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '2', '/home/baheri/.virtualenvs/worldmodels/bin/python', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '1', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.

Thanks.

No 'init.py' in 'custom_envs'

I had to add an __init__.py file to the custom_envs folder to make it work.

Error training VAE - input wrong dimension

I did generate only 2 episodes to do a dry run before committing to all 2000. Perhaps that's the problem.

ls -lh data
total 5.5G
-rw-rw-r-- 1 tkramer tkramer 1.4M Apr 23 08:41 action_data_car_racing_0.npy
-rw-rw-r-- 1 tkramer tkramer 1.4M Apr 23 08:57 action_data_car_racing_1.npy
-rw-rw-r-- 1 tkramer tkramer 2.8G Apr 23 08:41 obs_data_car_racing_0.npy
-rw-rw-r-- 1 tkramer tkramer 2.8G Apr 23 08:57 obs_data_car_racing_1.npy

trained with:
python 02_train_vae.py --start_batch 0 --max_batch 1 --new_model

At the end of the first epoch:

48000/48000 [==============================] - 323s 7ms/step - loss: 7.0854 - vae_r_loss: 6.5440 - vae_kl_loss: 0.5414 - val_loss: 3.3674 - val_vae_r_loss: 2.8730 - val_vae_kl_loss: 0.4944
Building batch 1...
Traceback (most recent call last):
File "02_train_vae.py", line 53, in
main(args)
File "02_train_vae.py", line 42, in main
vae.train(data)
File "/home/tkramer/projects/WorldModels/vae/arch.py", line 120, in train
callbacks=callbacks_list)
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/keras/engine/training.py", line 1630, in fit
batch_size=batch_size)
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/keras/engine/training.py", line 1476, in _standardize_user_data
exception_prefix='input')
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected input_1 to have 4 dimensions, but got array with shape (3840000, 64, 3)
Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x7f1e1df6f278>>
Traceback (most recent call last):
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 712, in del
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 31, in init
TypeError: 'NoneType' object is not callable

Confusion on implementation of 'get_mixture_coef()'

WorldModels/rnn/arch.py

Line 29 in 36ebabf

pi = K.reshape(pi, [-1, rollout_length, GAUSSIAN_MIXTURES, Z_DIM])

This line is from the function get_mixture_coef(). I wonder if pi should have shape [-1, rollout_length, GAUSSIAN_MIXTURES, 1] to match the definition of GMM in MDN. In this way, SketchRNN's implementation is a special case of this one when Z_DIM=2.

Am I right? I still cannot draw the final conclusion by myself. Hope to discuss with you.

Error training controller

/home/boucher/.virtualenvs/worldmodels/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
['mpirun', '-np', '2', '/home/boucher/.virtualenvs/worldmodels/bin/python', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '2', '--num_episode', '16', '--max_length', '1000', '--eval_steps', '25']
/home/boucher/.virtualenvs/worldmodels/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/boucher/.virtualenvs/worldmodels/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Using TensorFlow backend.
assigning the rank and nworkers 1 0
assigning the rank and nworkers 1 0
2018-04-18 18:30:08.922363: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-18 18:30:08.927218: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-18 18:30:09.084787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-18 18:30:09.085010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 9.47GiB
2018-04-18 18:30:09.085020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 18:30:09.085433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-18 18:30:09.085774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 9.47GiB
2018-04-18 18:30:09.085783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 18:30:09.224535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 18:30:09.224562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-18 18:30:09.224566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-18 18:30:09.224701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9110 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-04-18 18:30:09.228563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 18:30:09.228578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-18 18:30:09.228597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-18 18:30:09.229018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
size of model 867
size of model 867
(1,2mirr1)-aCMA-ES (mu_w=1.0,w_1=100%) in dimension 867 (seed=205998, Wed Apr 18 18:30:09 2018)
('process', 0, 'out of total ', 1, 'started')
('training', 'car_racing')
('population', 2)
('num_worker', 1)
('num_worker_trial', 2)
('num_episode', 16)
('max_length', 1000)
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Traceback (most recent call last):
  File "05_train_controller.py", line 461, in <module>
    main(args)
  File "05_train_controller.py", line 410, in main
    master()
  File "05_train_controller.py", line 319, in master
    send_packets_to_slaves(packet_list)
  File "05_train_controller.py", line 233, in send_packets_to_slaves
    assert len(packet_list) == num_worker-1
AssertionError
(1,2mirr1)-aCMA-ES (mu_w=1.0,w_1=100%) in dimension 867 (seed=250130, Wed Apr 18 18:30:09 2018)
('process', 0, 'out of total ', 1, 'started')
('training', 'car_racing')
('population', 2)
('num_worker', 1)
('num_worker_trial', 2)
('num_episode', 16)
('max_length', 1000)
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Traceback (most recent call last):
  File "05_train_controller.py", line 461, in <module>
    main(args)
  File "05_train_controller.py", line 410, in main
    master()
  File "05_train_controller.py", line 319, in master
    send_packets_to_slaves(packet_list)
  File "05_train_controller.py", line 233, in send_packets_to_slaves
    assert len(packet_list) == num_worker-1
AssertionError
Traceback (most recent call last):
  File "05_train_controller.py", line 460, in <module>
    if "parent" == mpi_fork(args.num_worker+1): os.exit()
  File "05_train_controller.py", line 429, in mpi_fork
    subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
  File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '2', '/home/boucher/.virtualenvs/worldmodels/bin/python', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '2', '--num_episode', '16', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1

Not exactly sure what the problem is here. I had to lower the number of workers to run on my gpu without running out of memory, but playing around with that number and the num_worker_trial I wasn't able to get it past this error.

RNN validation loss NaN and crashes EarlyStopping, but normal loss is fine

If I train the RNN with default configuration, the validation losses computed at the end of each epoch are NaN. This is strange, because the normal loss computed every batch seems to work fine.
Since the EarlyStopping callback monitors validation loss, the entire training script crashes after 5 epochs (since the tolerance is 5 and the callback therefore starts comparing NaN values after the 5th epoch)...

Any ideas what might cause this?

applieddatasciencepartners / worldmodels Goto Github PK

worldmodels's People

Stargazers

Watchers

Forkers

worldmodels's Issues

You can override this protection by adding the --allow-run-as-root option to your cmd line. However, we reiterate our strong advice against doing so - please do so at your own risk.

Recommend Projects

Recommend Topics

Recommend Org

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.