Giter Club home page Giter Club logo

worldmodels's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

worldmodels's Issues

About the new version

Hi, Thanks for your work and I find some files were updated, such as 03_generate_rnn_data.py and 04_train_rnn.py . Is it because the previous file has an error? (Because I ever used the files that has not updated to train my agent.) And I read the updated code and find they are almost the same except that RNN can predict reward now. Is it right?

how does Keras handle loss in a batch?

Hi, thanks for the implementation!

I have a question related to how Keras handles the loss in a batch. I noticed that the loss in the RNN is defined as:

result = K.mean(result, axis = (1,2)) # mean over rollout length and z dim

That means that you end up with one loss per sample in the batch. I am not very familiar with Keras, so I wonder about what happens with these losses when you call

rnn.fit(rnn_input, rnn_output,
        shuffle=True,
        epochs=mdn_epochs,
        batch_size=batch_size,
        validation_split=0.2,
        callbacks=callbacks_list)

Does Keras average over the batch? Could you point me to some documentation about it?

Thanks a lot!

VAE reconstruction loss

Wow! Cool to see a reproduction so fast! Thanks a lot!

Shouldn't the reconstruction loss:

` def vae_r_loss(y_true, y_pred):

        return K.sum(K.square(y_true - y_pred), axis = [1,2,3])

`
be based on the mean rather than sum?

If it's sum-based than its value depends on the size of the image. Additionally it throws off the balance between KL and reconstruction loss -- the larger the image, the more reconstruction loss overwhelms the optimisation.

Error training controller

/home/boucher/.virtualenvs/worldmodels/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
['mpirun', '-np', '2', '/home/boucher/.virtualenvs/worldmodels/bin/python', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '2', '--num_episode', '16', '--max_length', '1000', '--eval_steps', '25']
/home/boucher/.virtualenvs/worldmodels/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/boucher/.virtualenvs/worldmodels/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Using TensorFlow backend.
assigning the rank and nworkers 1 0
assigning the rank and nworkers 1 0
2018-04-18 18:30:08.922363: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-18 18:30:08.927218: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-18 18:30:09.084787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-18 18:30:09.085010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 9.47GiB
2018-04-18 18:30:09.085020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 18:30:09.085433: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-18 18:30:09.085774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 9.47GiB
2018-04-18 18:30:09.085783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-18 18:30:09.224535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 18:30:09.224562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-18 18:30:09.224566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-18 18:30:09.224701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9110 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-04-18 18:30:09.228563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-18 18:30:09.228578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-18 18:30:09.228597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-18 18:30:09.229018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
size of model 867
size of model 867
(1,2mirr1)-aCMA-ES (mu_w=1.0,w_1=100%) in dimension 867 (seed=205998, Wed Apr 18 18:30:09 2018)
('process', 0, 'out of total ', 1, 'started')
('training', 'car_racing')
('population', 2)
('num_worker', 1)
('num_worker_trial', 2)
('num_episode', 16)
('max_length', 1000)
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Traceback (most recent call last):
  File "05_train_controller.py", line 461, in <module>
    main(args)
  File "05_train_controller.py", line 410, in main
    master()
  File "05_train_controller.py", line 319, in master
    send_packets_to_slaves(packet_list)
  File "05_train_controller.py", line 233, in send_packets_to_slaves
    assert len(packet_list) == num_worker-1
AssertionError
(1,2mirr1)-aCMA-ES (mu_w=1.0,w_1=100%) in dimension 867 (seed=250130, Wed Apr 18 18:30:09 2018)
('process', 0, 'out of total ', 1, 'started')
('training', 'car_racing')
('population', 2)
('num_worker', 1)
('num_worker_trial', 2)
('num_episode', 16)
('max_length', 1000)
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Traceback (most recent call last):
  File "05_train_controller.py", line 461, in <module>
    main(args)
  File "05_train_controller.py", line 410, in main
    master()
  File "05_train_controller.py", line 319, in master
    send_packets_to_slaves(packet_list)
  File "05_train_controller.py", line 233, in send_packets_to_slaves
    assert len(packet_list) == num_worker-1
AssertionError
Traceback (most recent call last):
  File "05_train_controller.py", line 460, in <module>
    if "parent" == mpi_fork(args.num_worker+1): os.exit()
  File "05_train_controller.py", line 429, in mpi_fork
    subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
  File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '2', '/home/boucher/.virtualenvs/worldmodels/bin/python', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '2', '--num_episode', '16', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1

Not exactly sure what the problem is here. I had to lower the number of workers to run on my gpu without running out of memory, but playing around with that number and the num_worker_trial I wasn't able to get it past this error.

rnn_r_loss: -0.9799

Hi @davidADSP,
When I train the rnn model, it's weird that the rnn_r_loss decrease to negative value. I check the rnn_r_loss defined as:
result = -K.log(result + 1e-8)
it should not be a negative value. But it happens.
How does it happen and how can I solve the problem? Thank you.

RNN validation loss NaN and crashes EarlyStopping, but normal loss is fine

If I train the RNN with default configuration, the validation losses computed at the end of each epoch are NaN. This is strange, because the normal loss computed every batch seems to work fine.
Since the EarlyStopping callback monitors validation loss, the entire training script crashes after 5 epochs (since the tolerance is 5 and the callback therefore starts comparing NaN values after the 5th epoch)...

Any ideas what might cause this?

Unable to train controller on colab

!xvfb-run -a -s "-screen 0 1400x900x24" python 05_train_controller.py car_racing --num_worker 16 --num_worker_trial 2 --num_episode 4 --max_length 1000 --eval_steps 25

/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
['mpirun', '-np', '17', '/usr/bin/python3', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']

mpirun has detected an attempt to run as root.
Running at root is strongly discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

You can override this protection by adding the --allow-run-as-root
option to your cmd line. However, we reiterate our strong advice
against doing so - please do so at your own risk.

Traceback (most recent call last):
File "05_train_controller.py", line 525, in
if "parent" == mpi_fork(args.num_worker+1): os.exit()
File "05_train_controller.py", line 492, in mpi_fork
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '17', '/usr/bin/python3', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '16', '--num_worker_trial', '2', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.
[36c641d9ccde:04530] *** Process received signal ***
[36c641d9ccde:04530] Signal: Segmentation fault (11)
[36c641d9ccde:04530] Signal code: Address not mapped (1)
[36c641d9ccde:04530] Failing at address: 0x7f395ad0320d
[36c641d9ccde:04530] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f395ddb2890]
[36c641d9ccde:04530] [ 1] /lib/x86_64-linux-gnu/libc.so.6(getenv+0xa5)[0x7f395d9f1785]
[36c641d9ccde:04530] [ 2] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(_ZN13TCMallocGuardD1Ev+0x34)[0x7f395e25ce44]
[36c641d9ccde:04530] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5)[0x7f395d9f2615]
[36c641d9ccde:04530] [ 4] /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4(+0x13cb3)[0x7f395e25acb3]
[36c641d9ccde:04530] *** End of error message ***

Error training VAE - input wrong dimension

I did generate only 2 episodes to do a dry run before committing to all 2000. Perhaps that's the problem.

ls -lh data
total 5.5G
-rw-rw-r-- 1 tkramer tkramer 1.4M Apr 23 08:41 action_data_car_racing_0.npy
-rw-rw-r-- 1 tkramer tkramer 1.4M Apr 23 08:57 action_data_car_racing_1.npy
-rw-rw-r-- 1 tkramer tkramer 2.8G Apr 23 08:41 obs_data_car_racing_0.npy
-rw-rw-r-- 1 tkramer tkramer 2.8G Apr 23 08:57 obs_data_car_racing_1.npy

trained with:
python 02_train_vae.py --start_batch 0 --max_batch 1 --new_model

At the end of the first epoch:

48000/48000 [==============================] - 323s 7ms/step - loss: 7.0854 - vae_r_loss: 6.5440 - vae_kl_loss: 0.5414 - val_loss: 3.3674 - val_vae_r_loss: 2.8730 - val_vae_kl_loss: 0.4944
Building batch 1...
Traceback (most recent call last):
File "02_train_vae.py", line 53, in
main(args)
File "02_train_vae.py", line 42, in main
vae.train(data)
File "/home/tkramer/projects/WorldModels/vae/arch.py", line 120, in train
callbacks=callbacks_list)
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/keras/engine/training.py", line 1630, in fit
batch_size=batch_size)
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/keras/engine/training.py", line 1476, in _standardize_user_data
exception_prefix='input')
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected input_1 to have 4 dimensions, but got array with shape (3840000, 64, 3)
Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x7f1e1df6f278>>
Traceback (most recent call last):
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 712, in del
File "/home/tkramer/projects/WorldModels/env/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 31, in init
TypeError: 'NoneType' object is not callable

VAE performance is bad (with standard parameters)

Running the code as is I get the following results after training the VAE (python 02_train_vae.py --start_batch 0 --max_batch 9 --new_model):

image

loss: 0.1416 - vae_r_loss: 0.1416 - vae_kl_loss: 3.8347e-07 - val_loss: 0.1378 - val_vae_r_loss: 0.1378 - val_vae_kl_loss: 2.3842e-07

Has anyone changed anything, or what are your experiences?

NaNs in KL Loss

I'm getting NaN's in the KL Loss (specifically the K.square(vae_z_mean), part) in the first epoch of the first batch when trying to train the vae. First the KL loss (and overall loss) blows up, then starts going back down, but turns into NaN's before long:

13696/48000 [=======>......................] - ETA: 4:32 - loss: 289661273.9050 - vae_r_loss: 0.1846 - vae_kl_loss: 289661273.7244
23904/48000 [=============>................] - ETA: 3:13 - loss: 191863659.1293 - vae_r_loss: 0.1638 - vae_kl_loss: 191863658.9721
24992/48000 [==============>...............] - ETA: 3:04 - loss: nan - vae_r_loss: nan - vae_kl_loss: nan

Mac OS X 10.13
Python 3.6.2
tf.version '1.7.0'
keras.version '2.1.5'
gym.version '0.10.5'
np.version '1.14.2'

I've tried over from the beginning a couple of times, including a clean virtualenv. I've looked at some of the stats from the data files, and don't see anything too suspicious there.

Any suggestions would be welcome.

Changes I've tried:

  • SGD instead of RMSProp
  • Normalizing images (-mean, /std)
  • Increasing and decreasing batch size
  • Changing the constant in vae_r_loss() [There was some improvement when it was 1.0, more testing needed]
  • Train on a friend's generated data, which he used to successfully train a vae (on a GPU).

I haven't tried running on a GPU since I don't have a supported one.

I have successfully run steps 1-4 on my work computer (which i shouldn't be doing) so no NaN's for:

Mac OS X 10.9.5
CPU
Python 3.6.2
tf.version '1.3.0'
keras.version '2.1.3'
gym.version '0.10.4'
np.version '1.14.2'

I'm currently trying to run step 5 on my home machine using saved weights from work. Not a real solution, though.

Having issue in reproducing the results

Hi,

Is anyone able to reproduce the results claimed by @davidADSP ?
I followed the instructions and steps given in this documentation.
Apparently, the python command stated in the documentation are out of date. I used the command written on top of each python script to run the data generation and training.
However, the trained controller is unable to drive properly. I constantly getting bad result (e.g. -50 ~ 10 at the end for each round)

Confusion on implementation of 'get_mixture_coef()'

pi = K.reshape(pi, [-1, rollout_length, GAUSSIAN_MIXTURES, Z_DIM])

This line is from the function get_mixture_coef(). I wonder if pi should have shape [-1, rollout_length, GAUSSIAN_MIXTURES, 1] to match the definition of GMM in MDN. In this way, SketchRNN's implementation is a special case of this one when Z_DIM=2.

Am I right? I still cannot draw the final conclusion by myself. Hope to discuss with you.

Error training controller with rank

Thanks for sharing this great repo.
I am receiving the following error when I run the controller module.
In fact, the previous suggestion from @davidADSP (in one of the closed issues) did not help me.
Any other suggestion for the following error:

Traceback (most recent call last): File "05_train_controller.py", line 521, in if "parent" == mpi_fork(args.num_worker+1): os.exit() File "05_train_controller.py", line 491, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/usr/lib/python3.6/subprocess.py", line 291, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '2', '/home/baheri/.virtualenvs/worldmodels/bin/python', '-u', '05_train_controller.py', 'car_racing', '--num_worker', '1', '--num_worker_trial', '1', '--num_episode', '4', '--max_length', '1000', '--eval_steps', '25']' returned non-zero exit status 1.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.