Giter Club home page Giter Club logo

Comments (4)

ikki407 avatar ikki407 commented on August 27, 2024

Hi @adypd97, thank you for your interest in HandyRL!

First of all, after the training server launched, you need to run the workers in the VMs for worker: python main.py --worker (you should write the server address in the worker config (i.e. worker_args)) This command connects the workers to the server. After the server detects the worker connection, the learning process starts.

We illustrated the overview of the distributed architecture before in the Google Football Research competition. I hope this helps you.

Thanks

from handyrl.

adypd97 avatar adypd97 commented on August 27, 2024

Hi @ikki407!

Thanks for the link to the documentation! Very helpful!

To the main issue: Yes, I ran 2 worker VMs following the steps you mention (also, I entered the public IP of server VM (learner) for both workers in the worker_args parameter). Following that I got the OUTPUT mentioned in my initial comment. It seems like the learner is not able to detect the workers.

As further evidence for that I added a simple print statement to the following file ./handyrl/train.py in the following function (starting line 404):

    def run(self):
        print('waiting training')
        while not self.shutdown_flag:
            if len(self.episodes) < self.args['minimum_episodes']:
 >>>            print('here')
                time.sleep(1)
                continue
            if self.steps == 0:
                self.batcher.run()
                print('started training')
            model = self.train()
            self.report_update(model, self.steps)
        print('finished training') 

And in the output I get the following:
OUTPUT:

xyz@vm1:~/HandyRL$ python3 main.py --train-server
{'env_args': {'env': 'HungryGeese'}, 'train_args': {'turn_based_training': False, 'observation': False, 'gamma': 0.8, 'forward_steps': 32, 'compress_steps': 4, 'entropy_regularization': 0.002, 'entropy_regularization_decay': 0.3, 'update_episodes': 500, 'batch_size': 400, 'minimum_episodes': 1000, 'maximum_episodes': 200000, 'epochs': -1, 'num_batchers': 7, 'eval_rate': 0.1, 'worker': {'num_parallel': 32}, 'lambda': 0.7, 'max_self_play_epoch': 1000, 'policy_target': 'TD', 'value_target': 'TD', 'eval': {'opponent': ['modelbase'], 'weights_path': 'None'}, 'seed': 0, 'restart_epoch': 0}, 'worker_args': {'server_address': '<EXTERNAL_IP_OF_SERVER_GOES_HERE_FOR_WORKERS>', 'num_parallel': 32}}
Loading environment football failed: No module named 'gfootball'
started batcher 0
started batcher 1
started batcher 2
started batcher 3
started batcher 4
started batcher 5
waiting training
started entry server 9999
started batcher 6
started worker server 9998
started server
here
here
here...

I hope you find this helpful in assisting me. In any case thanks once again!

from handyrl.

ikki407 avatar ikki407 commented on August 27, 2024

From your outputs, it seems that the server is not connecting to the workers.

Next steps to debug...

  • Check if it runs correctly on your local machine (localhost)
  • Internal IP is available if your instances on same GCP network
  • Use small config to debug (batch size=1, minimum episodes=10, update episode=1, ...)
  • Check GCP network/firewall settings (ping command succeeds? TCP is allowed?)

from handyrl.

ikki407 avatar ikki407 commented on August 27, 2024

What the worker process/VM looks like? If the workers are still running without any errors, there maybe exist some problems I didn’t watch before.

from handyrl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.