Visual word sense disambiguation - SemEval 2023

About

V-WSD is a SemEval 2023 Task.
Task description : Given a word and some limited textual context, the task is to select among a set of candidate images the one which corresponds to the intended meaning of the target word.
Dataset : The dataset for the task was provided by the organizers of the task. All the target words and their contexts are in the English language for the training data. Each entry consists of a target word, its context and references to 10 images, all separated by an escape character (tabs). Only one of the corresponding image is the gold standard.

The training set is ~18GB, with 12869 samples. (12999 images are present)
e.g.

target_word <tab> full_phrase <tab> image_1 <tab> image_2 <tab> ... <tab> image_10

"target_word": the potentially ambiguous target word.

"full_phrase": the textual context containing the target_word.

Infrastructure

We used GCP's compute engine for our project. We used a single V100 GPU (16GB VRAM) for our experiments. We cached the dataset on GCS to avoid repeated downloads.

Running the code

The dependencies for running the code can be installed using the env.yaml file.

conda env create -f env.yaml
conda activate cs521

For Training the model, replace the angular braces accordingly for the following command

python main.py --base_path "<DATASET_DIR>" --model_save_path "<SAVE_DIR>" --model_log_path "<LOG_DIR>"

For Evaluating the model, replace the angular braces accordingly for the following command

python main.py --execute 1 --base_path "<DATASET_DIR>" --model_save_path "<SAVE_DIR>" --model_log_path "<LOG_DIR>"

Results

Training Results

Loss
Mean Reciprocal Rank
Hit rate @ 1

Future work

Due to the limited access to better hardware we were limited to one experiment, in the future we can

Perform hyperparameter search.
Investigate data augmentation. Further negative examples could be added to each sample based on the same target word.
Apply data augmentation on images such as color jitter, random crop etc.
Use more powerful vision encoders. We made use of the tiny variant of ConvNextV2.

References

LiT: Zero-Shot Transfer with Locked-image text Tuning, Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer

vaibhavbh-0 / visualwsd Goto Github PK

visualwsd's Introduction

Visual word sense disambiguation - SemEval 2023

About

Infrastructure

Running the code

Results

Future work

References

visualwsd's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent