Thanks again for the great work on this project. I've spent a couple

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can't get Autopilot to train correctly,about isl-org/openbot

Comments (85)

thias15 commented on May 22, 2024 1

Hi John.

Thank you very much for your detailed issue, I really appreciate it! This makes it much easier to help. First the good news: your procedure it correct. Now let me clarify a few things.

Cmd: This corresponds to a high-level command such as "turn left/right" or "go straight" at the next intersection. It is encoded as -1: left, 0: straight, 1: right. As you pointed out, this command can be controlled with the X, Y, or B buttons on the game controller. If you have LEDs connected, it will also control the left/right indicator signals of the car. These commands are logged in the indicatorLog.txt file. During training, the network is conditioned on these commands. If you approach intersection where car could go left, straight or right it is not clear what it should do based on the image only. This is where these commands come in to clear up these ambiguities. It seems that you just want the car to drive along a path in your house. In this case, I would recommend to just keep this cmd at 0. NOTE: This command should be the same when you test the policy on the car.
Label, Pred: These are the control signals of the car, mapped from -255,255 to -1,1. The label is obtained by logging the control that was used to drive the car. The prediction is what the network predicts to be the correct value given an image.
Clipping for image display: this is due to the data augmentation which results in some image values outside the valid range. You can just ignore this.

Now a few comments that will hopefully help you to get it to work.

The same motor value of 0.23 is a problem. This should not happen. Please try to delete the files in the sensor_data folder that were generated ("matched_..."). When you run the Jupyter notebook again, they will be regenerated.
In general the label values seems very low. We have used the "Fast" mode for data collection. I would recommend to do the same. Note that in lines 43-45 of the dataloader.py file, value are normalized into the range -1,1.

    def get_label(self, file_path):
        index = self.index_table.lookup(file_path)
        return self.cmd_values[index], self.label_values[index]/255

For the "Normal" mode, the maximum is capped at 192. For the "Slow" mode at 128.

Depending on the difficulty of the task, you may have to collect significantly more data. Could describe in a bit more detail, your data collection process and the driving task? Also, you may need to train for more epochs.

Hope this helps. Please keep me updated.

from openbot.

chilipeppr commented on May 22, 2024

That is super helpful. I think my earlier training might have been closer to correct where I just left the Cmd at 0, but I did train in Normal mode so all of my speeds being played back were really low but did seem to try to change higher or lower as I manually moved the camera around to follow the path. The values just never quite reached high enough to get the motors moving. I would say they lingered in the 0.1 range and maybe got to 0.2 as I moved the camera around. I even wrote code to amplify the speeds later, but that didn't quite work. I think I'll try to just record in Fast and/or make those code changes in dataloader.py. In terms of how I'm training, I'm just steering the car around my kitchen island over and over in a circle about 10 times to get a full logging to analyze. I figured I'd start simple and at least just get it going in a circle in one direction.

…

On Mon, Sep 14, 2020 at 3:46 AM thias15 ***@***.***> wrote: Hi John. Thank you very much for your detailed issue, I really appreciate it! This makes it much easier to help. First the good news: your procedure it correct. Now let me clarify a few things. 1. Cmd: This corresponds to a high-level command such as "turn left/right" or "go straight" at the next intersection. It is encoded as -1: left, 0: straight, 1: right. As you pointed out, this command can be controlled with the X, Y, or B buttons on the game controller. If you have LEDs connected, it will also control the left/right indicator signals of the car. These commands are logged in the indicatorLog.txt file. During training, the network is conditioned on these commands. If you approach intersection where car could go left, straight or right it is not clear what it should do based on the image only. This is where these commands come in to clear up these ambiguities. It seems that you just want the car to drive along a path in your house. In this case, I would recommend to just keep this cmd at 0. NOTE: This command should be the same when you test the policy on the car. 2. Label, Pred: These are the control signals of the car, mapped from -255,255 to -1,1. The label is obtained by logging the control that was used to drive the car. The prediction is what the network predicts to be the correct value given an image. 3. Clipping for image display: this is due to the data augmentation which results in some image values outside the valid range. You can just ignore this. Now a few comments that will hopefully help you to get it to work. 1. The same motor value of 0.23 is a problem. This should not happen. Please try to deleted the files in the sensor_data folder that were generated ("matched_..."). 2. In general the label values seems very low. We have used the "Fast" mode for data collection. I would recommend to do the same. If you use the "Normal" or "Slow" modes, you will need to make a change in lines 43-45 of the dataloader.py file. def get_label(self, file_path): index = self.index_table.lookup(file_path) return self.cmd_values[index], self.label_values[index]/255 For the "Normal" mode, replace the 255 by 192. For the "Slow" mode, replace the 255 by 128. I will update the code to be more user-friendly in the future. 1. Depending on the difficulty of the task, you may have to collect significantly more data. Could describe in a bit more detail, your data collection process and the driving task? Also, you may need to train for more epochs. Hope this helps. Please keep me updated. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4J23J6WAI25Z2ZB7CDA5TSFXYAXANCNFSM4RLBOAWA> .

from openbot.

chilipeppr commented on May 22, 2024

Ok, here's a video of how I train. I used Fast mode (vs Normal or Slow). I set to AUTOPILOT_F and used NNAPI.

https://photos.app.goo.gl/o6BtAHunDjtj8fMNA

And then here's a video of playing back that training. It still doesn't quite work, but I do seem to be getting more movement in the robot with training in Fast mode vs Normal.

https://photos.app.goo.gl/kCw4DpRN6vPpbtCcA

from openbot.

parixit commented on May 22, 2024

@chilipeppr super helpful video! It would be great if you could a step-by-step video of your build for complete newbies.

from openbot.

chilipeppr commented on May 22, 2024

I would love to. I figure I might be one of the first to build one of these outside of the Intel team after the public posting of the project as I happened to have every piece needed already sitting in my home workshop, so no need to wait for shipping. It's hard getting this going without more Youtube videos!

…

On Mon, Sep 14, 2020 at 9:55 AM Parixit ***@***.***> wrote: @chilipeppr <https://github.com/chilipeppr> super helpful video! It would be great if you could a step-by-step video of your build for complete newbies. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4J23LWNCEALSQT3WH7IRDSFZDHRANCNFSM4RLBOAWA> .

from openbot.

parixit commented on May 22, 2024

Agreed! This project is daunting but want to do it together with my kids. Waiting on the parts and I had our local library 3D print the parts (even they were interested in the project). I'll look forward to your videos, keep us posted!

from openbot.

chilipeppr commented on May 22, 2024

Is it possible that with my kitchen island I have to train each turn around the island as a right turn? Meaning turn on Cmd = 0 on the straight parts and then turn on Cmd = 1 as I turn right 4 times?

from openbot.

thias15 commented on May 22, 2024

@chilipeppr If you would like to contribute with build videos that would be awesome and we would be very happy to include them for others in the README! I realize that a lot of people require much more detailed instructions. We are working to provide more comprehensive documentation, but at the moment I have a lot of other engagements as well. For the time lapse video, I did record video of a complete build, but did not get a change to edit it yet. If you like, I'd be happy to setup a quick call with you to coordinate.

from openbot.

thias15 commented on May 22, 2024

The predicted control values still seem to be too low. Could you post the figures at the end of training? I'm afraid, the model did not converge properly or overfit. The training and validation loss should both decrease and the direction and angle metrics should both increase.

The task of your choice should be learnable and keeping the indicator command at 0 should be fine since you are driving along a fixed path. However, I suspect that you need to train the model for more epochs and that you need more training data. I would recommend to:

Collect maybe 10 datasets with 5 loops each driving around the kitchen block. Start/stop logging should ideally be done while driving along the trajectory. In the video you have recorded, you drive somewhere else at the end before the logging is stopped. This could lead to difficulty during training, especially if there is not a lot of data.
Take 8 of these datasets for training and the remaining two for validation. By monitoring the validation metrics, you should get a good idea of when the model is starting to work.

Collecting good/clean data is key to machine learning. I know it is not a lot of fun to collect such data, but it is what makes it work in the end! Keep up the great work. Looking forward to your next update (hopefully with the robot driving autonomously).

from openbot.

chilipeppr commented on May 22, 2024

Ok, I retrained with 10 datasets -- 8 for the training and 2 for the testing. Each run was 5 to 7 loops around the kitchen island. I turned the noise on for 3 of the dataset runs as well.

Here's a video of how I did the training. It's similar to my first post, but I started logging while in motion. I kept the Cmd=0 (default).
https://www.youtube.com/watch?v=W7EHo0Jk02A

On the phone these are the zip files that I copied and extracted to the train_data and test_data folders. Notice they're all around 40MB to 80MB in size which feels correct from a size per training session. Again, I used crop_img.

Here are the 8 training datasets placed into the policy/dataset folder.

Here are the 2 test datasets.

I also ran it at Normal speed, but changed the divider to 192 in dataloader.py from the 255 value in there by default since it assumes Fast mode.

I also did the start/stop logging by hitting the A button on the XBox controller while I was in motion on the robot on the start and stop so I would log no speeds of 0. You can see for the 10 datasets I had almost no frames removed for speed 0. I'm even surprised I ended up with any frames of speed 0 in the output because I don't recall stopping, so that's a bit of a concern.

I ended up with the most amount of frames I've ever trained with.

I ended up with much higher numbers in the Label here than the 0.23 numbers you were worried about in my original post.

Here is the mode.fit output. I'd love to understand what the loss, direction_metric, and angle_metric mean to know whether this output seems reasonable or not.

Here is the Evaluation data.

I'm a little worried about these warnings, but maybe they're ok to ignore.

And then here's the final output with predictions. The motor values in the predictions sure seem better.

However, when I go to run the Autopilot with this new model, it still seems to have failed. The only progress is I now have motor movement. Before the motor values were so low I had no movement. Here's a video of the auto-pilot running and the robot not staying on the path but rather just running into chairs.

https://www.youtube.com/watch?v=a0-0lh7_j0E

from openbot.

chilipeppr commented on May 22, 2024

Hmm. Do I need to edit any of these variables? My crop_img images are 256x96.

from openbot.

chilipeppr commented on May 22, 2024

Well, apparently the crop_imgs must be correct because I tried doing a training session with preview_img and when I went to train I got these errors.

from openbot.

thias15 commented on May 22, 2024

crop_imgs is correct
Can you try to change the batch size to a value which is as high as possible? We trained with a batch size of 128, but this will most likely require a GPU. If you cannot, you need to decrease the learning rate. From the plots it looks like it is too high for the dataset.
I'm not sure if rescaling all the values by 192 would work since the car was never actually driven with the resulting values. Did you mix the "Normal" and "Fast" datasets?
In line 29, the fact that the label for all images is the same (0.88,0.73) is definitely problematic as well. (The reverse label is generated by FLIP_AUG). For your task of going around the kitchen block in one direction you should probably set FLIP_AUG to False!
If you like you can upload your dataset somewhere and I'll have a look. This would be much quicker to debug.

from openbot.

chilipeppr commented on May 22, 2024

Here is a link to download my dataset. It's the 10 sessions I ran yesterday based on your initial feedback. 8 of the sessions are in train_data as a Zip file. 2 of the sessions are in test_data as a Zip file.

https://drive.google.com/drive/folders/18MchBUtods4sRerSpaA6eTrtC9DPvpbd?usp=sharing

I just tried training the dataset again with your feedback above:

I changed the batch to 128. I have an Nvidia Geforce GTX on my Surface Book 3 so no problem on the GPU needed for that change.
All of my training was done at the Normal speed, so the 192 divider should be ok. There is no Fast in this dataset.
I turned off FLIP_AUG.

The results still didn't do anything for me. The robot still acts the same way. I did train for 20 epochs this time.

The "best fit" was epoch 2 so that was a lot of wasted CPU/GPU going to 20 epochs.

from openbot.

thias15 commented on May 22, 2024

I will download the data and investigate. The fact that it reaches perfect validation metrics after two epochs and then completely fails is very strange. Did you also try to deploy the last.tflite model or run it on some test images to see if the predictions make sense?

from openbot.

thias15 commented on May 22, 2024

When I visualize your data, I do see variation in the labels as expected. Do you still see all the same labels?

from openbot.

chilipeppr commented on May 22, 2024

Yeah, in my training my labels are still all the same. So this does seem messed up.

from openbot.

chilipeppr commented on May 22, 2024

On your question "Did you also try to deploy the last.tflite model" I did and it was the same failure. It just kept showing a motor value around 0.75 on both left and right motors, sometimes jumping to 0.8 and it would just drive right into chairs/walls.

from openbot.

thias15 commented on May 22, 2024

This is definitely a problem. In the best case the network will learn this constant label. Did you make any changes to the code? I'm using the exact code from the Github repo with no changes (except FLIP_AUG = false in cell 21). In case you made changes, could you stash them or clone a fresh copy of the repo? The put the same data you uploaded into the corresponding folders and see if you can reproduce what I showed in the comment above.

from openbot.

chilipeppr commented on May 22, 2024

I haven't changed any of the code. I did try that last run with the batch size changed and FLIP_AUG = false. I also tried epoch=20. I did change dataloader.py to divide by 192. Other than that the code is totally the same. I can try to re-check out the repo, but I don't think that's going to change much.

One thing I'm trying right now is to create a new conda environment with tensorflow instead of tensorflow-gpu as the library.

from openbot.

chilipeppr commented on May 22, 2024

Why do I get clipping errors and you don't for utils.show_train_batch?

from openbot.

thias15 commented on May 22, 2024

I also get the clipping errors, I just scrolled down so more images with labels are visible. I'm currently running tensorflow on CPU on my laptop without GPU. It will take some time. But it should not make any difference. For the paper all experiments were performed on a workstation with a GPU. One difference is that I only used Mac and Linux. Maybe there is a problem with Windows for the way the labels are looked up? From the screenshots it seems you're on Windows.

from openbot.

thias15 commented on May 22, 2024

One thing you could try is running everything in the Linux subsystem of Windows.

from openbot.

chilipeppr commented on May 22, 2024

Yes, I'm on Windows. Surface Book 3 with Nvidia GPU.

from openbot.

thias15 commented on May 22, 2024

I'll update you in about 30-60 minutes regarding training progress. But it seems that your issue is the label mapping. I suspect at this point it is related to Windows. As I mentioned, you could try to run the code in the Linux subsystem in Windows. I will also see if I can run it in a VM or setup a Windows environment for testing.

from openbot.

chilipeppr commented on May 22, 2024

I'm wondering, if you get a final best.tflite file out of your run if you could send that to me to try out on the robot.

I hear you on the label mapping. Could this possibly be something as dumb as Windows doing CR/LF and Mac/Linux using LF?

from openbot.

thias15 commented on May 22, 2024

Hello. It finished training for 10 epochs now. The plots look reasonable, so why don't you give it a try. To achieve good performance usually some hyperparameter tuning, more data and more training time is needed. But let's see.
best.tflite.zip

from openbot.

thias15 commented on May 22, 2024

notebook.html.zip
This is the complete output of my Jupyter notebook to give you some idea how the output should look like. When I get a chance, I will explore the issue you encounter in a Windows environment. It could be something like CR/LF vs LF, but since the code relies on the os library, these types of things should be taken care of. I don't know, but will let you know what I discover. Thanks for your patience. I really want you to be able to train your own models and will try my best to figure out the issue you are encountering.

from openbot.

thias15 commented on May 22, 2024

Note that both files need to be unpacked. I had to zip them in order to upload them here.

from openbot.

chilipeppr commented on May 22, 2024

I just tried running your best.tflite and it does not work any better. The robot still runs into walls.

from openbot.

thias15 commented on May 22, 2024

Does it have a tendency to turn to the right as expected by the test images?

from openbot.

chilipeppr commented on May 22, 2024

Yes, it does seem to tend slightly to the right as it's driving.

from openbot.

thias15 commented on May 22, 2024

The predicted values are not quite large enough. More data, training time and some hyperparameter tuning should greatly improve things. We also have not trained our models at the "Normal" speed before, so not sure if this could also have some effect.

from openbot.

thias15 commented on May 22, 2024

Can you check line 11 of you notebook? How do you define the base_dir?

from openbot.

chilipeppr commented on May 22, 2024

What if I try the Cmd = -1, 0, 1 stuff with hitting X/Y/B at the appropriate time? I suppose I'd just be hitting B to turn right and Y to go straight since I just keep taking right turns. Make some training data that way and see how it goes?

Recall I did make a few datasets earlier with the Fast mode being used and still didn't see good results.

from openbot.

thias15 commented on May 22, 2024

@chilipeppr since your label values are messed up, I expect that all of your previous training runs did not render any models that have actually learnt something useful. Adding these cmd value will likely not help and would also require you to apply those cmds during test time.

from openbot.

chilipeppr commented on May 22, 2024

base_dir = "./dataset"

from openbot.

thias15 commented on May 22, 2024

In unix environments the current directory is define as ./
However, in Windows it is defined as .\

The current directory is denoted by the "."
But the point is that the slashes to define subsequent directories or different.

from openbot.

thias15 commented on May 22, 2024

Can you try to change it, run it again and see if you get reasonable labels?

from openbot.

chilipeppr commented on May 22, 2024

Ok, trying right now...

from openbot.

chilipeppr commented on May 22, 2024

Nope. Same problem. Here's my line 11:

from openbot.

chilipeppr commented on May 22, 2024

Are you sure my train_ds isn't just getting indexed such that the left/right pairs are next to eachother in memory and your next() and iter() are just returning stuff in an ordered way? But maybe on other computers the index is in a different order?

from openbot.

thias15 commented on May 22, 2024

I believe the issue is in the data loader. Basically, I build a dictionary that uses the frame paths as key to the labels. This is not really the best way of doing it, but worked fine. I suspect that using the path as key leads to a problem in Windows.

from openbot.

thias15 commented on May 22, 2024

It's already late here. I'll see if I can get a Windows setup tomorrow to figure out the issue.

from openbot.

chilipeppr commented on May 22, 2024

Ok, looks like you were right. The left/right data is identical in each line that you read. See this debug output. I just threw in a debug statement at the end of that loop.

from openbot.

chilipeppr commented on May 22, 2024

Wait, no, that's not true. The data is fine. The way I drive to collect data is I use Video Game mode where I use RT to drive straight without touching any other joysticks. Then when I go to turn I keep RT held while I nudge the joystick. So, much of the data has matching left/right vals, but not all of it. So perhaps this is really just a coincidence. As you can see when you move further into the data the left/right are different.

from openbot.

thias15 commented on May 22, 2024

Yes, but the samples for visualization are randomly drawn. It is very unlikely that all labels are the same. And this does not happen in my case. There seems to be something wrong with the dictionary. From the debug output, could it be that base_dir needs to be .\\dataset or simply dataset if that works in Windows?

from openbot.

thias15 commented on May 22, 2024

One more idea. Can you replace line 24 in the data loader.py with this line and see if that fixes the labels.
lines = data.replace("\","/").replace(","," ").replace("\t"," ").split("\n")

lines = data.replace("\\","/").replace(","," ").replace("\t"," ").split("\n")

from openbot.

chilipeppr commented on May 22, 2024

Hmm. I did that and now have this output, which looks good to match Mac/Linux, but my images are still all the same labels.

Here's line 24. I also added a "\r" to be replaced by nothingness just in case that was messing things up. If there isn't a "\r" it gracefully moves on.

Here's debug output.

Sadly, the images have the same labels still.

from openbot.

thias15 commented on May 22, 2024

Hello. I'm working on the Windows setup. In the mean-time, I have trained two models on my workstation with GPU for 100 epochs. Can you try if these models work better?
bz16_lr1e-4.zip
bz128_lr3e-4.zip

from openbot.

chilipeppr commented on May 22, 2024

Ok, that's really good progress. I just tested out the 1st zip file and the driving on the robot is much better. It's still bumping into walls, but it appears much more intelligent. It starts to turn much more. I'll try the other zip file and then record some video.

100 epochs is a lot of processing so I will need the GPU. I tried getting VirtualBox going last night and alternatively Windows Subsystem for Linux but neither worked well. No GPU in VirtualBox and WSL seems to not run at all because Python seems to report to the Tensorflow library the file separator is \ even though it's a Linux environ and should be / and thus the Jupyter Notebook fails to run.

from openbot.

chilipeppr commented on May 22, 2024

https://www.youtube.com/watch?v=27EiBkpkbtU

This is the bz16 file running on the robot. It works much better than the bz128 which I did not record video for since it wasn't really doing anything more than all the other tflite files that have been created.

In this video I just keep turning on the Network and then right before it runs into a wall I turn it back to video game mode. I manually back it up and then let the autopilot run again. It's actually starting to sort of make it around curves.

It seems like I should do training where the robot is staring more at the walls/objects on the left/right so it gets trained to turn in a more aggressive way to the center line of the path.

from openbot.

thias15 commented on May 22, 2024

Ok, it seems we are on a good track to getting it to work. One key factor for robust task performance is noise injection. This is explained in more detail in the paper. It basically helps to explore the state-space (kind of what you mentioned in your last sentence). On the playstation controller, you can activate it by the option button. When you press it, the phone will say "noise enabled". Now, when you drive the robot, random noise will be added to the controls periodically, but only your control commands will be recorded. This will result in the robot to end up in bad states (e.g. facing a wall or obstacle or deviating from the best path) but it will only record the control to recover from this state (controls applied by the operator to compensate for the noise). This is very important for robust performance. In our experiments, we collected about 50% of the data with noise. Could you collect this type of data?

from openbot.

chilipeppr commented on May 22, 2024

Three of those training datasets had the noise button turned on with the XBox controller right at the start of the collection.

…

On Wed, Sep 16, 2020 at 10:01 AM thias15 ***@***.***> wrote: Ok, it seems we are on a good track to getting it to work. One key factor for robust task performance is noise injection. This is explained in more detail in the paper. It basically helps to explore the state-space (kind of what you mentioned in your last sentence). On the playstation controller, you can activate it by the option button. When you press it, the phone will say "noise enabled". Now, when you drive the robot, random noise will be added to the controls periodically, but only your control commands will be recorded. This will result in the robot to end up in bad states (e.g. facing a wall or obstacle or deviating from the best path) but it will only record the control to recover from this state (controls applied by the operator to compensate for the noise). This is very important for robust performance. In our experiments, we collected about 50% of the data with noise. Could you collect this type of data? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4J23JVKJM73JXC4Z6VWYLSGDVOLANCNFSM4RLBOAWA> .

from openbot.

thias15 commented on May 22, 2024

So I setup a Windows machine and figured out a solution. Keep your changes to the dataloader to match Mac/Linux format. Then change one line in the follwing two tf functions in the Jupyter notebook (cells 29 and 30) for processing the file paths.

def process_train_path(file_path)
Change the first line
from cmd, label = train_data.get_label(file_path)
to cmd, label = train_data.get_label(tf.strings.regex_replace(file_path,"[/\\\\]","/"))
def process_test_path(file_path)
Change the first line
from cmd, label = test_data.get_label(file_path)
to cmd, label = test_data.get_label(tf.strings.regex_replace(file_path,"[/\\\\]","/"))

It's like 4am here, so I don't trust myself to push something to the repo now. :) I will make sure everything works cross-platform tomorrow and then push it. But thought since you are probably still up, I'll give you a heads-up. Let me know if this works for you.

from openbot.

chilipeppr commented on May 22, 2024

Oh, that's awesome. I'll go try it right now and thank you so much for staying up until 4AM. This looks like a pretty darn simple fix.

…

On Wed, Sep 16, 2020 at 6:59 PM thias15 ***@***.***> wrote: So I setup a Windows machine and figured out a solution. Keep your changes to the dataloader to match Mac/Linux format. Then change one line in the follwing two tf functions in the Jupyter notebook (cells 29 and 30) for processing the file paths. 1. def process_train_path(file_path) Change the first line from cmd, label = train_data.get_label(file_path) to cmd, label = train_data.get_label(tf.strings.regex_replace(file_path,"[/\\\\]","/")) 2. def process_test_path(file_path) Change the first line from cmd, label = test_data.get_label(file_path) to cmd, label = test_data.get_label(tf.strings.regex_replace(file_path,"[/\\\\]","/")) It's like 4am here, so I don't trust myself to push something to the repo now. :) I will make sure everything works cross-platform tomorrow and then push it. But thought since you are probably still up, I'll give you a heads-up. Let me know if this works for you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4J23I4QCY6QD7QEAT7HS3SGFUQPANCNFSM4RLBOAWA> .

from openbot.

thias15 commented on May 22, 2024

bz16_lr1e-4.zip
This policy was trained on the Windows setup (a bit longer than before).

The values are still slightly lower than the labels, but I think it's worth a try. From here on, to get a better policy, some tweaking of the parameters and more data will probably help. If things still don't work, it may be worth to review the data to make sure driving behaviour is consistent since inconsistencies in the data make it very hard to learn. Also, collecting a dataset with the "Fast" speed may be something to try.

It could also be, that this task is actually not very easy to learn for the network, since the driving path is less well defined compared to our experiments where the corridor boundaries are very clear. At the end of the day, this network infers where to drive based only on single images.

from openbot.

chilipeppr commented on May 22, 2024

Hello there. Just tried your latest file and yes, that is the best one so far. It does a better job than the last one and the last one was showing good progress. What were your settings for this one? Just the default file with those changes to process_train_path and process_test_path? Or did you change epochs? And/or change batch size? Or other stuff? I'd like to run this myself like you did, but I'll add way more data and more noise.

from openbot.

thias15 commented on May 22, 2024

Same settings.

bz: 16
lr: 0.0001

But I trained for 200 epochs and still saw improvement in the training/test loss. Using a SurfaceBook 2 with GPU, one epoch only takes 45 seconds, so the complete training just took a few hours.

Did the fix work for you, i.e. do you know see different reasonable label values?

from openbot.

chilipeppr commented on May 22, 2024

No, I still see the same labels. So not sure how yours works correctly.

from openbot.

chilipeppr commented on May 22, 2024

Here's the changes I made, just like you were saying. All of the other files like dataloader.py are straight from the Github repo with no changes.

from openbot.

thias15 commented on May 22, 2024

Did you restart the kernel? It does not automatically reload imports. This is needed for the changes you make in data loader for the Mac/Linux format of paths. Also you can try to delete the old matched*.txt files which may need to be regenerated.

from openbot.

thias15 commented on May 22, 2024

In oder to get Linux/Mac format for the paths, I did the following:
In cell 8 change
from os.path.join(base_dir, "train_data")
to base_dir + "/" + "train_data"
and
from os.path.join(base_dir, "test_data")
to base_dir + "/" + "test_data"

In dataloader.py change line 24
from lines = data.replace(","," ").replace("\t"," ").split("\n")
tolines = data.replace(","," ").replace("\\","/").replace("\r","").replace("\t"," ").split("\n")

from openbot.

chilipeppr commented on May 22, 2024

Yes, I did restart the kernel. I always hit

I just tried the delete the matched*.txt and it doesn't seem to fix anyhthing. The labeled_ds seems to just keep repeating the same labels. It still seems that process_train_path doesn't work, or what if the labels are wrong before that step?

from openbot.

chilipeppr commented on May 22, 2024

Finally! Those last code changes finally got the labels to be loaded correctly!

from openbot.

chilipeppr commented on May 22, 2024

Ok, now in your earlier specs, is bz the batch size?

And then where do I specify lr?

from openbot.

thias15 commented on May 22, 2024

Cell 34: LR = 0.0001

from openbot.

thias15 commented on May 22, 2024

Yes, BZ is the batch size.

from openbot.

chilipeppr commented on May 22, 2024

Ahh. Ok. I figured all tweakable values were in the same section. I see where LR is now.

from openbot.

thias15 commented on May 22, 2024

Good point, I will refactor it. The reason it is there now is because it is directly related to the optimizer.

from openbot.

thias15 commented on May 22, 2024

I have just pushed the changes and everything should work now on Windows as well. So feel free to pull if you want a clean copy. I have also added a little more info to the notebook and improved the dataset construction. It is probably two orders of magnitude faster now. This will make your life easier when training with larger datasets.

from openbot.

chilipeppr commented on May 22, 2024

Awesome. I checked out the new changes and will run them.

Question, does flipping the phone over matter for messing up the data collection? I've had the USB port on the right side, but found I get a higher facing trajectory in the images if I flip USB port onto the left. If anything, that may mess up the consistency of the training, which is that the phone has a slightly different upward tilt thus the horizon is lower in the image.

I do know for running the person detect AI you need the phone flipped the correct way, as initially I had it flipped the wrong way and the robot kept driving away from the person in frame. Once I accidentally flipped the phone the other way it started working and I was surprised to realize the mistake. Worth putting into the docs.

from openbot.

thias15 commented on May 22, 2024

The image will be rotated automatically, so it should not affect data collection. As long as the phone is mounted horizontally, it should not make a difference. If it is mounted vertically, the problem will be the limited horizontal field of view and the image cropping. For the person following, it actually works in both horizontal and vertical orientation. I just tested the "opposite" horizontal orientation (180 degrees) and observed the behaviour you described. This seems to be a bug and I did not notice it before. It is probably related to the logic that detects the phone orientation and adapts the computation of the motor controls. I will look into it and fix it.

from openbot.

thias15 commented on May 22, 2024

By the way, I have also noticed that if you train the pilot_net network for your dataset, it seems to achieve much better performance. It is the default in the new notebook. It has a much bigger capacity, but still runs in real-time. However, due to the larger network you may run into memory issues during training depending on the specs of your machine. If it does not work, just change the model back to cil_mobile. In the training section, first cell change
model = models.pilot_net(NETWORK_IMG_WIDTH,NETWORK_IMG_HEIGHT,BN)
to
model = models.cil_mobile(NETWORK_IMG_WIDTH,NETWORK_IMG_HEIGHT,BN)

from openbot.

chilipeppr commented on May 22, 2024

Oh, interesting. On my Surface Book 3 I have 32GB of RAM so hopefully I'm in good shape for running it on pilot_net. Maybe that should be a configuration up at the top of the notebook too with some explanation of the difference on the models. I would have not realized this without this comment. I'll try to run my data against it right now. I actually just collected a bunch more data with noise turned on and with placing objects in the path to get the data even more interesting.

from openbot.

thias15 commented on May 22, 2024

Cool, let me know how it goes.

from openbot.

chilipeppr commented on May 22, 2024

Hmm. With the changes you made to these lines you are slurping in data for other datasets than the ones I specify at the top. Not sure you meant to do that. I realized this because the debug output below it was showing other folders and then I got an error saying my images were different sizes, which I only got because some older datasets I tried training at the "preview" size rather than "crop" size. I fixed it by just removing my older datasets to outside the train_data and test_data, but that is a difference.

from openbot.

thias15 commented on May 22, 2024

Yes, sorry the assumption here is that you want to train on all data in the train_data_dir. The individual datasets you set will be ignored. I changed this, because it is much faster. I'll see if I can come up with a better solution. I guess in the mean-time, just revert to the old way.

from openbot.

chilipeppr commented on May 22, 2024

It is a lot faster, so I'm enjoying the change.

from openbot.

chilipeppr commented on May 22, 2024

I'm running the epochs with your latest code right now. I'm on epoch 5 out of 10. Here's my CPU/RAM/GPU usage. It is using a lot of RAM and CPU, but it does not appear to be using my GPU. Any ideas? I did install tensorflow-gpu as my python library and I'm on version 2.3.0.

Here's the Nvidia GPU stats in Task Manager. Zero usage.

from openbot.

chilipeppr commented on May 22, 2024

Ok, I fixed it by moving back to the conda environment which only has 2.1.0 as the latest version of tensorflow-gpu and not using my direct install of python with the pip install of tensorflow-gpu. Here's my GPU usage now under the conda environment.

from openbot.

chilipeppr commented on May 22, 2024

https://www.youtube.com/watch?v=q0yYN-Ohqwc

Here is the latest video of my latest tflite build. I give it a score of 70%.

I realized you closed this issue, but this is the latest run with about 100,000 images and 10,000 test images of just a simple circle around my kitchen. My goal is to get it to 99% so I figure I'll train it with 200,000 images and with 20,000 test images in hopes that this gets me to a reasonable spot.

from openbot.

thias15 commented on May 22, 2024

Yes makes sense. Feel free to reopen if you feel it is not solved. I closed it because the original issue was solved (getting it to train correctly) which was related to the Windows OS. You are raising other issues now (e.g. conda version, final task performance, etc.) which are very interesting and I'm happy to help. However, I would prefer to have a seperate issue with descriptive title for each. This way it can help others with similar questions later on.

from openbot.

thias15 commented on May 22, 2024

Hmm. With the changes you made to these lines you are slurping in data for other datasets than the ones I specify at the top. Not sure you meant to do that. I realized this because the debug output below it was showing other folders and then I got an error saying my images were different sizes, which I only got because some older datasets I tried training at the "preview" size rather than "crop" size. I fixed it by just removing my older datasets to outside the train_data and test_data, but that is a difference.

This is fixed now. The default is to use all datasets, but you can specify individual datasets as well.

from openbot.

Pascal66 commented on May 22, 2024

You didn't look at the right place, it's not 3D tab, but the CUDA tab

Here's the Nvidia GPU stats in Task Manager. Zero usage.

from openbot.

Can't get Autopilot to train correctly about openbot HOT 85 CLOSED

Comments (85)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent