Giter Club home page Giter Club logo

Comments (21)

lzj322 avatar lzj322 commented on May 26, 2024 1

@GBJim, @Cysu I guess that Dataparallel of pytorch doesn't work well with Nvidia-docker. Or maybe it is caused by pytorch pytorch forum

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

I wonder if it is fine to run the official mnist example?

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

Hi @Cysu
After going through the MNIST example. No errors happen.

I also tried to train the inception net in example: python examples/inception.py -d viper -b 64 -j 2 --loss xentropy --logs-dir logs/inception-viper-xentropy
No errors happen as well.

The interesting thing is that I tried to train ResNet again:
The training process froze like the following:, but no errors.

Files already downloaded and verified
VIPeR dataset loaded
subset | # ids | # images

train | 216 | 432
val | 100 | 200
trainval | 316 | 632
query | 316 | 632
gallery | 316 | 632
Epoch: [0][1/7] Time 160.275 (160.275) Data 0.446 (0.446) Loss 5.375 (5.375) Prec 0.00% (0.00%)
Epoch: [0][2/7] Time 0.563 (80.419) Data 0.001 (0.223) Loss 10.057 (7.716) Prec 0.00% (0.00%)

Is this caused by the GPU resource usage?
Currently, some Caffe process is also using my GPUs.

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

I'm not sure if it is caused by some deadlocks between pytorch and caffe, especially when both are using NCCL. You may try to run it again when the caffe experiments are finished.

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

Hi @Cysu
Sorry for late response.
I tried it again after my Caffe process is terminated.

The training will be frozen when the -j (worker) argument is set to be bigger than 1.
If the -j argument is set to be 1, I get error: [Errno 111] Connection refused

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

@GBJim Could you please change the num_workers in the official mnist example and see if it has the same problem?

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

@Cysu:

I tested the MNIST example with 16 workers. Everything is correct

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

Sorry but currently I have no idea why it happened. There should be no much difference between our data loader with the mnist ones. I'm not sure if it is related to using root instead of normal user on Linux.

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

Thanks @Cysu
I will try to figure it out!

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

@GBJim any luck on this?

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

Hi @Cysu
I've built a new environment for open RE-ID and cloned the latest commit.
But it seems like the resnet.py and inception.py are removed from the example folder.

Is there new tutorial of how to do a training or testing?
Thanks!

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

It seems like the codes are re-organized into oim_loss.py, softmax_loss.py and, triplet_loss.py
Let me check if my these scripts can work

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

@Cysu

I tried these commands: python examples/oim_loss.py -d viper or python examples/softmax_loss.py -d viper and python examples/triplet_loss.py -d viper as well.
Tthe following output is prompted and then the process was frozen. I need to use ctrl+z to exit for the process

root@e50f76502ce4:~/open-reid# python examples/oim_loss.py -d viper
Files already downloaded and verified
VIPeR dataset loaded
  subset   | # ids | # images
  ---------------------------
  train    |   216 |      432
  val      |   100 |      200
  trainval |   316 |      632
  query    |   316 |      632
  gallery  |   316 |      632

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

@GBJim Oh, I forgot to update the tutorials. Just finished. Please check here.

Does the previous error still occur when -j 1 is use?

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

@Cysu

The process is still frozen when I set to single job. (Maybe I should wait for the process for longer time)

I set job to 1 and tried the following combinations:

OIM + ResNet --> Frozen

OIM + Inception --> RuntimeError: The expanded size of the tensor (128) must match the existing size (64) at non-singleton dimension 1. at /root/pytorch/torch/lib/THC/generic/THCTensor.c:323

SOFTMAX + ResNet --> Frozen

SOFTMAX + Inception --> Works Normally

And thank you for updating the documentation!

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

That's weird... What's the script for OIM + Inception?

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

@Cysu
python examples/oim_loss.py -d viper -a inception -j 1

from open-reid.

lzj322 avatar lzj322 commented on May 26, 2024

I meet the same issue. The problems that @GBJim had happen to me as well. Particularly, this, inception.py has nothing wrong, but resnet.py is Frozen.

from open-reid.

GBJim avatar GBJim commented on May 26, 2024

@lzj322 Do you use Nvidia-docker to host the environment?

from open-reid.

lzj322 avatar lzj322 commented on May 26, 2024

@GBJim yes. Would that be a problem? I don't know much about it. I asked the administrator to reset the docker. Now it has normal results. But we don't know why.
I am afraid that this issue could happen someday again.

from open-reid.

Cysu avatar Cysu commented on May 26, 2024

@lzj322 Yeah, two programs cannot run on the same device if using NCCL.

from open-reid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.