Giter Club home page Giter Club logo

corsmal's Introduction

The CORSMAL Challenge 2020 Submission

We also composed a Tech Report ๐Ÿ“–. Feel free to check it out to learn more about the implementation details.

Also here is the video presentation from the conference on YouTube ๐ŸŽฅ.

Overview ๐Ÿ”Ž

The design of an individual model for each sub-task: container filling level (left), filling type (middle) classification, and container capacity estimation (right).

CORSMAL Challenge 2020 Submission (Because It`s Tactile)

For the filling level estimation we rely on both modalities: RGB (R(2+1)d features) and audio ("classical" and VGGish features). In particular, we encode streams from each camera placed at different view points with the individual GRU models. Next, we sum up the logits and apply softmax at the output. Next, we process VGGish featured extracted from the audio stream and process it with another GRU model which outputs the class probabilities. Then, the "classical" features are classified with the random forest algorithm. Finally, the probabilities from all three models are averaged to produce the final prediction for the filling level.

The procedure for filling type classification resembles the one for the filling level except for the absence of RGB stream.

The capacity estimation pipeline starts with the extraction of two frames from two camera views (left and right): the 1st and the 20th frames to the end. Both frames are passed through the Mask R-CNN to detect a container. The detected bounding boxes are used to crop both frames. The crops are sent to the LoDE model which outputs the estimation of the container capacity. If an object was not detected on either of the frames, we use the training prior.

Feel hungry for more details, check out our Tech Report ๐Ÿ“–.

Team ๐Ÿ‘‹

Task โš”๏ธ

The CORSMAL challenge focuses on the estimation of the weight of containers which depends on the presence of a filling and its amount and type, in addition to the container capacity. Participants should determine the physical properties of a container while it is manipulated by a human, when both containers and fillings are not known a priori.

Technically, the main task requires to estimate the overall filling mass estimation. This quantity can be estimated by solving three sub-tasks:

  • Container capacity estimation (any positive number)
  • Filling type classification (boxes: pasta, rice; glasses/cups: water, pasta, rice; or nothing (empty))
  • Filling level classification (0, 50, 90%)

Dataset ๐Ÿฅค๐Ÿ“˜๐Ÿฅ›

  • Download the CORSMAL dataset here
  • The dataset consists of 15 containers: 5 drinking cups, glasses, and food boxes. These containers are made of different materials, such as plastic, glass, and paper.
  • A container can be filled with water (only glasses and cups), rice or pasta at 3 different levels of 0, 50, and 90% with respect to the capacity of the container.
  • All different combinations of containers are executed by a different subject (12) for each background (2) and illumination condition (2). The total number of configurations is 1140.
  • Each event in the dataset is acquired with several sensors, making the CORSMAL dataset to be multi-modal
    • 4 cameras (1280x720@30Hz):
      • RGB,
      • narrow-baseline stereo infrared,
      • depth images,
      • inertial measurements (IMU)
      • calibration
    • 8-channel 44.1 kHz audio;

Code organization ๐Ÿ“

Each folder in this repo corresponds to a dedicated task. Filling level and types have two or three different approaches and, hence, each approach has an individual folder.

  • /capacity
  • /filling_level
    • /CORSMAL-pyAudioAnalysis
    • /r21d_rgb
    • /vggish
    • README.md
  • /filling_type
    • /CORSMAL-pyAudioAnalysis
    • /r21d_rgb
    • /vggish
    • README.md

How to run evaluation โ”

We recommend to use our docker image to run the evaluation script. (To experement with different approaches, inspect individual README files inside of the task folders.) Another way to run it without docker is to install the environments using the commands from Dockerfile (ignore the first 11 lines) and execute run_eval.sh script.

Running the script will take <4 hours on 10-core i9-7900X X-series, 4x1080Ti (one gpu is enough), RAM 64Gb (or at least 20Gb but the more GPU you will use the more RAM it will allocate), 40Gb of extra disk space (besides the dataset).

Install docker (19.03.13) and run our script

# pull our image from the docker hub
docker pull iashin/corsmal:latest

# source: the path to dir with corsmal on the host; destination: path where the sorce folder will be mounted
# if you would like to attach shell (and debug) just append `/bin/bash` to the command above
# because, by default, it will run the evaluation script
# Also, the script will work even with just one GPU but if you would like to speed up the 1st part try more
docker run \
    --mount type=bind,source=/path/to/corsmal/,destination=/home/ubuntu/CORSMAL/dataset/ \
    -it --gpus '"device=0,1,2,3"' \
    iashin/corsmal:latest

# copy submission files from the container once it finishes running the script
docker cp container_id:/home/ubuntu/CORSMAL/submission_public_test.csv .
docker cp container_id:/home/ubuntu/CORSMAL/submission_private_test.csv .

The expected structure of the DATA_ROOT folder:

DATA_ROOT
โ”œโ”€โ”€ [1-9]
โ”‚   โ”œโ”€โ”€ audios
โ”‚   โ”‚   โ””โ”€โ”€ sS_fiI_fuU_bB_lL_audio.wav
โ”‚   โ”œโ”€โ”€ calib
โ”‚   โ”‚   โ””โ”€โ”€ sS_fiI_fuU_bB_lL_cC_calib.pickle
โ”‚   โ”œโ”€โ”€ depth
โ”‚   โ”‚   โ””โ”€โ”€ sS_fiI_fuU_bB_lL
โ”‚   โ”œโ”€โ”€ ir
โ”‚   โ”‚   โ””โ”€โ”€ sS_fiI_fuU_bB_lL_cC_irR.mp4
โ”‚   โ””โ”€โ”€ rgb
โ”‚       โ””โ”€โ”€ sS_fiI_fuU_bB_lL_cC.mp4
โ””โ”€โ”€ [10-15]
    โ”œโ”€โ”€ audio
    โ”‚   โ””โ”€โ”€ XXXX
    โ”œโ”€โ”€ calib
    โ”‚   โ””โ”€โ”€ XXXX_cC_calib.pickle
    โ”œโ”€โ”€ depth
    โ”‚   โ””โ”€โ”€ XXXX
    โ”œโ”€โ”€ ir
    โ”‚   โ””โ”€โ”€ XXXX_cC_irR.mp4
    โ””โ”€โ”€ rgb
        โ””โ”€โ”€ XXXX_cC.mp4

Please note, we undertook extreme care to make sure our results are reproducible by fixing the seeds, sharing the pre-trained models, and package versions. However, the training on another hardware might give you slightly different results. We observed that the change is ๐Ÿค <0.01.

How to train a new model? ๐Ÿš€

Please follow the instructions provided in filling_level and filling_type. Note, we do not use any training for capacity sub-task but you still can adapt the code to apply it on your dataset.

License ๐Ÿ‘ฉโ€โš–๏ธ

We distribute our code under MIT licence. Yet, our code relies on libraries which have different licence.

  • LoDE (container capacity): Creative Commons Attribution-NonCommercial 4.0
  • pyAudioAnalysis (filling level and type): Apache-2.0 License
  • video_features (filling level and type): check here:

BibTeX ๐Ÿ™‚

@misc{CORSMAL_Iashin_2020,
      title={Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-robot Handovers},
      author={Vladimir Iashin and Francesca Palermo and G\"okhan Solak and Claudio Coppola},
      year={2020},
      eprint={2012.01311},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

corsmal's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.