<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi I got this result. <div class="snippet-clipboard-content notr

nan loss and weights in training,about mks0601/i2l-meshnet_release

Comments (12)

mks0601 commented on August 10, 2024

Hi!

Hmm.. This is very weird because I haven't meet any NaN issue during training in lots of experiments...
Could you check you are loading correct meshes in Human36M/Human36M.py and MSCOCO/MSCOCO.py? Maybe you can use vis_mesh and save_obj functions in utils/vis.py.
Also, could you train again without loss_mesh_normal and loss_mesh_edge?

I haven't tried batch size 48 (your GPU seems very huge because it can handle 48), but I tried 8 and 16, and tried 2~4 GPUs.

from i2l-meshnet_release.

windness97 commented on August 10, 2024

@mks0601 Hi, thank you for your prompt reply!
I've just disabled mesh_fit, mesh_normal and mesh_edge, since they all become nan at a point in my training process, and I'm trying to visualize the mesh models.
It might take a while before I had further progress.
Thanks again!

from i2l-meshnet_release.

mks0601 commented on August 10, 2024

I think loss_mesh_fit is necessary for the mesh reconstruction. Please let me know any progress!

from i2l-meshnet_release.

windness97 commented on August 10, 2024

@mks0601 Hi!
I think I've found out where the problem is.

I've already successfully executed demo/demo.py (using the pre-trained snapshot you provide: snapshot_8.pth.tar), and the output rendered_mesh_lixel.jpg looks fine, so maybe it's not the SMPL model problem.

I've tried to disabled loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge'], and use the snapshot to test in demo.py (using common/utils/vis.py: function vis_keypoints()) to visualize. the mesh is a mess but the keypoints seems fine, so keypoints regression has no problem.

Then I try to debug the training process to figure out what makes the mesh loss nan (loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge']), and it turns out to be targets['fit_mesh_img'] (in main/model.py: forward()). targets['fit_mesh_img'] randomly contains some nan value (usually only 1 vertex coordinate become nan) at some point (for every epoch, it happens only several times) in my training process.

Actually this error happens randomly at a small probabity, so I wonder if it is relative to some specific imgs, so I record some imgs from Human3.6M that trigger the error:

../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg

and I write a script to reproduce the error:
debug_h36m_nan.txt

the above script:

defines a dataset class SubHuman36M extends data.Human36M.Human36M.Human36M. It only returns designated samples (2 imgs above), so I slightly overwrite __init__() and load_data(). I also overwrite __getitem__(), setting the parameter exclude_flip of function augmentation() to be True (because in this way the nan error always occurs) and modifying the return of __getitem__() to include the img path for logging. No more modification besides above.
uses SubHuman36M to create a dataloader that does same operations like Human36M but only on designated imgs, and simply call it for processed data. check if targets['fit_mesh_img'] contains nan values.

just put it in main dir and use:

python debug_h36m_nan.py

In my environment settings, the nan error ALWAYS happens on the 2 designated imgs (note that I've modified the __getitem__() and force the augmentation do not flip):

debug_h36m_nan.py
creating index...
  0%|          | 0/1559752 [00:00<?, ?it/s]index created!
Get bounding box and root from groundtruth
100%|██████████| 1559752/1559752 [00:02<00:00, 659397.26it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg

start test:


----- test no.0 -----
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:12: RuntimeWarning: divide by zero encountered in true_divide
  x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:13: RuntimeWarning: divide by zero encountered in true_divide
  y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.1 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.2 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.3 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.4 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']

Process finished with exit code 0

you can see that before the error, there's divide-by-zero warning happens in common/utils/transforms.py. Follows this clue, I find out that the nan value comes from Human36M.py: function get_smpl_coord(), which returns smpl_mesh_coord that contains 0 value on z-axis, and then divide-by-zero triggered in common/utils/transforms.py. I have no better ideas how to deal with this error, so I simply add a small float value to the denominator:

def cam2pixel(cam_coord, f, c):
    # if False:
    if cam_coord.shape[0] > 6000 and len(np.where(cam_coord[:, 2] == 0)[0]) > 0:
        x = cam_coord[:,0] / (cam_coord[:,2] + 0.001) * f[0] + c[0]
        y = cam_coord[:,1] / (cam_coord[:,2] + 0.001) * f[1] + c[1]
        z = cam_coord[:,2]
    else:
        x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
        y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
        z = cam_coord[:,2]
    return np.stack((x,y,z),1)

and the error seems to be solved.

I don't know if this will cause any accuracy loss or other problem and why common/utils/transforms.py return smpl_mesh_coord with 0 value.(Maybe it's a bug that only occurs on specific environment settings?) I don't know if there is other reason that causes nan error besides divide-by-zero.

I've just started training on that simple modification and see if there is other problem.

Any suggestion?

from i2l-meshnet_release.

mks0601 commented on August 10, 2024

I got this result.

creating index...
index created!
Get bounding box and root from groundtruth
100%|██████████████████████████████████████████████████| 1559752/1559752 [00:06<00:00, 247654.87it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg

start test:


----- test no.0 -----
----- test no.1 -----
----- test no.2 -----
----- test no.3 -----
----- test no.4 -----
----- test no.5 -----
----- test no.6 -----
----- test no.7 -----
----- test no.8 -----
----- test no.9 -----

Basically, I didn't get any NaN error. Could you check which cam2pixel function gives error and check whether some coordinates contain zero element?

from i2l-meshnet_release.

windness97 commented on August 10, 2024

@mks0601 Hi!
Sure.

I've debugged the script only on s_06_act_14_subact_02_ca_04_000856.jpg, and the error goes like this:
in main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 201-206:
(I've modified this file, so maybe the line number is not correct)

                # smpl coordinates
                smpl_mesh_cam, smpl_joint_cam, smpl_pose, smpl_shape = self.get_smpl_coord(smpl_param, cam_param, do_flip, img_shape)
                smpl_coord_cam = np.concatenate((smpl_mesh_cam, smpl_joint_cam))
                focal, princpt = cam_param['focal'], cam_param['princpt']
                smpl_coord_img = cam2pixel(smpl_coord_cam, focal, princpt)

on line 202, the returned smpl_mesh_cam contains 0 value, and so smpl_coord_cam contains 0 value.
on line 206, smpl_coord_cam is passed into the func cam2pixel as cam_coord, which contains 0 value on z-axis, so divide-by-zero occurs, and so smpl_coord_img contains -inf values:

you can see that the no.4794 vertex is the only vertex that contains 0 on z-axis (smpl_mesh_cam and smpl_coord_cam), then after line 206, smpl_coord_img have -inf values on x-axis and y-axis of the no.4794 vertex.

Then main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 208-215:

                # affine transform x,y coordinates, root-relative depth
                smpl_coord_img_xy1 = np.concatenate((smpl_coord_img[:, :2], np.ones_like(smpl_coord_img[:, :1])), 1)
                smpl_coord_img[:, :2] = np.dot(img2bb_trans, smpl_coord_img_xy1.transpose(1, 0)).transpose(1, 0)[:, :2]
                smpl_coord_img[:, 2] = smpl_coord_img[:, 2] - smpl_coord_cam[self.vertex_num + self.root_joint_idx][2]
                # coordinates voxelize
                smpl_coord_img[:, 0] = smpl_coord_img[:, 0] / cfg.input_img_shape[1] * cfg.output_hm_shape[2]
                smpl_coord_img[:, 1] = smpl_coord_img[:, 1] / cfg.input_img_shape[0] * cfg.output_hm_shape[1]
                smpl_coord_img[:, 2] = (smpl_coord_img[:, 2] / (cfg.bbox_3d_size * 1000 / 2) + 1) / 2. * \
                                       cfg.output_hm_shape[0]  # change cfg.bbox_3d_size from meter to milimeter

after line 209, smpl_coord_img_xy1 contains -inf values too.
after line 210, smpl_coord_img contains -inf on x-axis and nan on y-axis:

Then after line 226, smpl_mesh_img contains -inf and nan, which will be the final output (targets['fit_mesh_img']):

                # split mesh and joint coordinates
                smpl_mesh_img = smpl_coord_img[:self.vertex_num];

If I'm the only one that have this problem, then it might be sth to do with my environment settings.
I'm using Python3.7.7, torch==1.4.0, numpy==1.19.1.
here's the detailed pkgs list, return by pip list:

Package         Version
--------------- -------------------
certifi         2020.6.20
chumpy          0.69
cycler          0.10.0
Cython          0.29.21
decorator       4.4.2
freetype-py     2.2.0
future          0.18.2
imageio         2.9.0
kiwisolver      1.2.0
matplotlib      3.3.1
networkx        2.4
numpy           1.19.1
opencv-python   4.4.0.42
Pillow          7.2.0
pip             20.2.2
pycocotools     2.0.1
pyglet          1.5.7
PyOpenGL        3.1.0
pyparsing       2.4.7
pyrender        0.1.43
python-dateutil 2.8.1
scipy           1.5.2
setuptools      49.6.0.post20200814
six             1.15.0
torch           1.4.0
torchgeometry   0.1.2
torchvision     0.5.0
tqdm            4.48.2
transforms3d    0.3.1
trimesh         3.8.1
wheel           0.34.2

Here's the good news.
I modified cam2pixel() to avoid the divide-by-zero problem, and the training process is fine so far.
Here's the keypoints and mesh result after 2 epochs.

Could you please show me your python environment details? That will help. ^ ^

from i2l-meshnet_release.

mks0601 commented on August 10, 2024

Did you mosifiy get_smpl_coord function of Human36M.py? For example, make the xoordinates root-relative. Could you check yours with mine line by line? cam means camera_centered coordinates, and 0 z-axis coordinate means zero distance from camera in z-axis, which is non-sense. Could you visialize smpl_coord_img on image in Human36M.py using vis_mesh function?

from i2l-meshnet_release.

windness97 commented on August 10, 2024

Hmmmm...
I'm sure that I haven't changed anything in get_smpl_coord() of Human36M.py.
I failed to visualize the smpl_coord_img on the input img, I still can't understand these coordinates transforms. I'll let you know if I have further progress. ^ ^

Besides, do you think that it could be due to using the 3DMPPE version of Human3.6M dataset?

from i2l-meshnet_release.

mks0601 commented on August 10, 2024

The data from 3DMPPE is exactly same with that of I2L-MeshNet. I just added SMPL parameters. Ah when did you download the H36M data? I changed extrinsic camera parameters and corresponding functions at Jun 8 this year. I think this can make the coordinates zero because translation vector was changed. If you downloaded them before Jun 8, could you download camera parameters again and check the error?

from i2l-meshnet_release.

windness97 commented on August 10, 2024

@mks0601 Hi!
wow, that makes sense! I downloaded the H36M data at leas half a year ago.
I'll redownload the annotations and check if it'll solve the problem. ^ ^

from i2l-meshnet_release.

mks0601 commented on August 10, 2024

Awesome!

from i2l-meshnet_release.

windness97 commented on August 10, 2024

Problem solved!
Thank you! ^ ^

from i2l-meshnet_release.

nan loss and weights in training about i2l-meshnet_release HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent