Comments (12)
Hi!
Hmm.. This is very weird because I haven't meet any NaN issue during training in lots of experiments...
Could you check you are loading correct meshes in Human36M/Human36M.py
and MSCOCO/MSCOCO.py
? Maybe you can use vis_mesh
and save_obj
functions in utils/vis.py.
Also, could you train again without loss_mesh_normal
and loss_mesh_edge
?
I haven't tried batch size 48 (your GPU seems very huge because it can handle 48), but I tried 8 and 16, and tried 2~4 GPUs.
from i2l-meshnet_release.
@mks0601 Hi, thank you for your prompt reply!
I've just disabled mesh_fit, mesh_normal and mesh_edge, since they all become nan at a point in my training process, and I'm trying to visualize the mesh models.
It might take a while before I had further progress.
Thanks again!
from i2l-meshnet_release.
I think loss_mesh_fit
is necessary for the mesh reconstruction. Please let me know any progress!
from i2l-meshnet_release.
@mks0601 Hi!
I think I've found out where the problem is.
I've already successfully executed demo/demo.py
(using the pre-trained snapshot you provide: snapshot_8.pth.tar
), and the output rendered_mesh_lixel.jpg
looks fine, so maybe it's not the SMPL model problem.
I've tried to disabled loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge']
, and use the snapshot to test in demo.py
(using common/utils/vis.py: function vis_keypoints()
) to visualize. the mesh is a mess but the keypoints seems fine, so keypoints regression has no problem.
Then I try to debug the training process to figure out what makes the mesh loss nan (loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge']
), and it turns out to be targets['fit_mesh_img']
(in main/model.py: forward()
). targets['fit_mesh_img']
randomly contains some nan value (usually only 1 vertex coordinate become nan) at some point (for every epoch, it happens only several times) in my training process.
Actually this error happens randomly at a small probabity, so I wonder if it is relative to some specific imgs, so I record some imgs from Human3.6M that trigger the error:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg
and I write a script to reproduce the error:
debug_h36m_nan.txt
the above script:
- defines a dataset class
SubHuman36M
extendsdata.Human36M.Human36M.Human36M
. It only returns designated samples (2 imgs above), so I slightly overwrite__init__()
andload_data()
. I also overwrite__getitem__()
, setting the parameterexclude_flip
of functionaugmentation()
to beTrue
(because in this way the nan error always occurs) and modifying the return of__getitem__()
to include the img path for logging. No more modification besides above. - uses
SubHuman36M
to create a dataloader that does same operations likeHuman36M
but only on designated imgs, and simply call it for processed data. check iftargets['fit_mesh_img']
contains nan values.
just put it in main
dir and use:
python debug_h36m_nan.py
In my environment settings, the nan error ALWAYS happens on the 2 designated imgs (note that I've modified the __getitem__()
and force the augmentation do not flip):
debug_h36m_nan.py
creating index...
0%| | 0/1559752 [00:00<?, ?it/s]index created!
Get bounding box and root from groundtruth
100%|██████████| 1559752/1559752 [00:02<00:00, 659397.26it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg
start test:
----- test no.0 -----
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:12: RuntimeWarning: divide by zero encountered in true_divide
x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:13: RuntimeWarning: divide by zero encountered in true_divide
y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.1 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.2 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.3 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.4 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
Process finished with exit code 0
you can see that before the error, there's divide-by-zero warning happens in common/utils/transforms.py
. Follows this clue, I find out that the nan value comes from Human36M.py: function get_smpl_coord()
, which returns smpl_mesh_coord
that contains 0 value on z-axis, and then divide-by-zero triggered in common/utils/transforms.py
. I have no better ideas how to deal with this error, so I simply add a small float value to the denominator:
def cam2pixel(cam_coord, f, c):
# if False:
if cam_coord.shape[0] > 6000 and len(np.where(cam_coord[:, 2] == 0)[0]) > 0:
x = cam_coord[:,0] / (cam_coord[:,2] + 0.001) * f[0] + c[0]
y = cam_coord[:,1] / (cam_coord[:,2] + 0.001) * f[1] + c[1]
z = cam_coord[:,2]
else:
x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
z = cam_coord[:,2]
return np.stack((x,y,z),1)
and the error seems to be solved.
I don't know if this will cause any accuracy loss or other problem and why common/utils/transforms.py
return smpl_mesh_coord
with 0 value.(Maybe it's a bug that only occurs on specific environment settings?) I don't know if there is other reason that causes nan error besides divide-by-zero.
I've just started training on that simple modification and see if there is other problem.
Any suggestion?
from i2l-meshnet_release.
Hi
I got this result.
creating index...
index created!
Get bounding box and root from groundtruth
100%|██████████████████████████████████████████████████| 1559752/1559752 [00:06<00:00, 247654.87it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg
start test:
----- test no.0 -----
----- test no.1 -----
----- test no.2 -----
----- test no.3 -----
----- test no.4 -----
----- test no.5 -----
----- test no.6 -----
----- test no.7 -----
----- test no.8 -----
----- test no.9 -----
Basically, I didn't get any NaN error. Could you check which cam2pixel function gives error and check whether some coordinates contain zero element?
from i2l-meshnet_release.
@mks0601 Hi!
Sure.
I've debugged the script only on s_06_act_14_subact_02_ca_04_000856.jpg
, and the error goes like this:
in main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 201-206
:
(I've modified this file, so maybe the line number is not correct)
# smpl coordinates
smpl_mesh_cam, smpl_joint_cam, smpl_pose, smpl_shape = self.get_smpl_coord(smpl_param, cam_param, do_flip, img_shape)
smpl_coord_cam = np.concatenate((smpl_mesh_cam, smpl_joint_cam))
focal, princpt = cam_param['focal'], cam_param['princpt']
smpl_coord_img = cam2pixel(smpl_coord_cam, focal, princpt)
on line 202, the returned smpl_mesh_cam
contains 0 value, and so smpl_coord_cam
contains 0 value.
on line 206, smpl_coord_cam
is passed into the func cam2pixel
as cam_coord
, which contains 0 value on z-axis, so divide-by-zero occurs, and so smpl_coord_img
contains -inf values:
you can see that the no.4794 vertex is the only vertex that contains 0 on z-axis (smpl_mesh_cam
and smpl_coord_cam
), then after line 206, smpl_coord_img
have -inf values on x-axis and y-axis of the no.4794 vertex.
Then main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 208-215
:
# affine transform x,y coordinates, root-relative depth
smpl_coord_img_xy1 = np.concatenate((smpl_coord_img[:, :2], np.ones_like(smpl_coord_img[:, :1])), 1)
smpl_coord_img[:, :2] = np.dot(img2bb_trans, smpl_coord_img_xy1.transpose(1, 0)).transpose(1, 0)[:, :2]
smpl_coord_img[:, 2] = smpl_coord_img[:, 2] - smpl_coord_cam[self.vertex_num + self.root_joint_idx][2]
# coordinates voxelize
smpl_coord_img[:, 0] = smpl_coord_img[:, 0] / cfg.input_img_shape[1] * cfg.output_hm_shape[2]
smpl_coord_img[:, 1] = smpl_coord_img[:, 1] / cfg.input_img_shape[0] * cfg.output_hm_shape[1]
smpl_coord_img[:, 2] = (smpl_coord_img[:, 2] / (cfg.bbox_3d_size * 1000 / 2) + 1) / 2. * \
cfg.output_hm_shape[0] # change cfg.bbox_3d_size from meter to milimeter
after line 209, smpl_coord_img_xy1
contains -inf values too.
after line 210, smpl_coord_img
contains -inf on x-axis and nan on y-axis:
Then after line 226, smpl_mesh_img
contains -inf and nan, which will be the final output (targets['fit_mesh_img']
):
# split mesh and joint coordinates
smpl_mesh_img = smpl_coord_img[:self.vertex_num];
If I'm the only one that have this problem, then it might be sth to do with my environment settings.
I'm using Python3.7.7, torch==1.4.0, numpy==1.19.1.
here's the detailed pkgs list, return by pip list
:
Package Version
--------------- -------------------
certifi 2020.6.20
chumpy 0.69
cycler 0.10.0
Cython 0.29.21
decorator 4.4.2
freetype-py 2.2.0
future 0.18.2
imageio 2.9.0
kiwisolver 1.2.0
matplotlib 3.3.1
networkx 2.4
numpy 1.19.1
opencv-python 4.4.0.42
Pillow 7.2.0
pip 20.2.2
pycocotools 2.0.1
pyglet 1.5.7
PyOpenGL 3.1.0
pyparsing 2.4.7
pyrender 0.1.43
python-dateutil 2.8.1
scipy 1.5.2
setuptools 49.6.0.post20200814
six 1.15.0
torch 1.4.0
torchgeometry 0.1.2
torchvision 0.5.0
tqdm 4.48.2
transforms3d 0.3.1
trimesh 3.8.1
wheel 0.34.2
Here's the good news.
I modified cam2pixel()
to avoid the divide-by-zero problem, and the training process is fine so far.
Here's the keypoints and mesh result after 2 epochs.
Could you please show me your python environment details? That will help. ^ ^
from i2l-meshnet_release.
Did you mosifiy get_smpl_coord function of Human36M.py? For example, make the xoordinates root-relative. Could you check yours with mine line by line? cam means camera_centered coordinates, and 0 z-axis coordinate means zero distance from camera in z-axis, which is non-sense. Could you visialize smpl_coord_img on image in Human36M.py using vis_mesh function?
from i2l-meshnet_release.
Hmmmm...
I'm sure that I haven't changed anything in get_smpl_coord()
of Human36M.py
.
I failed to visualize the smpl_coord_img
on the input img, I still can't understand these coordinates transforms. I'll let you know if I have further progress. ^ ^
Besides, do you think that it could be due to using the 3DMPPE version of Human3.6M dataset?
from i2l-meshnet_release.
The data from 3DMPPE is exactly same with that of I2L-MeshNet. I just added SMPL parameters. Ah when did you download the H36M data? I changed extrinsic camera parameters and corresponding functions at Jun 8 this year. I think this can make the coordinates zero because translation vector was changed. If you downloaded them before Jun 8, could you download camera parameters again and check the error?
from i2l-meshnet_release.
@mks0601 Hi!
wow, that makes sense! I downloaded the H36M data at leas half a year ago.
I'll redownload the annotations and check if it'll solve the problem. ^ ^
from i2l-meshnet_release.
Awesome!
from i2l-meshnet_release.
Problem solved!
Thank you! ^ ^
from i2l-meshnet_release.
Related Issues (20)
- I have googled it, but it has not been solved yet. I have the following problem, please help me
- The provided freihand pose param is different from freihand origin dataset HOT 3
- help for size mismatch problem HOT 1
- Training Settings for FreiHAND HOT 6
- Question about downsampled mesh performance HOT 1
- MSCOCO Background HOT 3
- issues about mano param in freihand dataset HOT 1
- pytorch matrix size incorrect HOT 1
- bbox_root_freihand_output.json HOT 2
- training the rootnet with Freihand Dataset. HOT 3
- the focal of rootnet training HOT 1
- Change to MANO in demo.py HOT 1
- The thickness of the generated mesh HOT 2
- when running the command: python demo.py --gpu 3 --stage param --test_epoch 8 I get the following error, can someone help me please to solve it HOT 1
- clarification of OLD issue - Smpl body to image projection
- Question about joint numbers HOT 3
- KeyError: 'mesh_coord_cam' HOT 4
- Issue about h36m_smpl dataset HOT 11
- Issue about h36m dataset HOT 4
- freihand_train_data.json format HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from i2l-meshnet_release.