Giter Club home page Giter Club logo

wuweilun / ntu-dlcv-2023 Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 125.83 MB

NTU Deep Learning for Computer Vision 2023 course

Python 24.86% Shell 0.30% Jupyter Notebook 58.51% C++ 9.19% Cuda 7.02% C 0.05% Makefile 0.01% CSS 0.03% HTML 0.01% Dockerfile 0.01%
byol-pytorch captioning-images classification clip deep-learning diffusion-models domain-adaptation imagegeneration peft-fine-tuning-llm segmentaion self-supervised-learning viewsynthesis zero-shot-learning

ntu-dlcv-2023's Introduction

NTU-DLCV-2023

This repository contains the assignments and final project for the NTU Deep Learning for Computer Vision 2023 course. There are a total of four assignments and one final project. Detailed information is provided below.

Assignment 1 Report

Task 1: Image Classification

  • Validation Accuracy:

    • Model A (ResNet-18): 74.96%
    • Model B (EfficientNetV2-S): 90.20%
  • PCA Visualization: Limited class separation, with only some classes clearly distinguishable.

  • t-SNE Visualization: Improved class separations compared to PCA, with better clustering.

Task 2: Self-Supervised Learning (SSL) BYOL

Setting Pre-training Fine-tuning Validation Accuracy
A - Train full model 44.58%
B w/ label (TA backbone) Train full model 49.01%
C w/o label (SSL backbone) Train full model 50.74%
D w/ label (TA backbone) Fix backbone 28.33%
E w/o label (SSL backbone) Fix backbone 30.54%

Task 3: Semantic Segmentation

  • mIoU (Mean Intersection over Union):
    • VGG16-FCN32s: 0.7253
    • DeepLabV3-ResNet50: 0.7597

Score: 100/100

Assignment 2 Report

Task 1: DDPM

  • Inference Time:

    • Initial: 20 minutes for 1000 images
    • After Disabling CFG: 10 minutes
  • Accuracy:

    • With CFG: 99.5%
    • Without CFG: 96%

Task 2: DDIM

  • Eta Values & Image Diversity:

    • Eta = 0: Denoised images identical to originals
    • Eta ≥ 0.5: Increased diversity in denoised images
  • Interpolation Observations:

    • Spherical Linear Interpolation (Slerp): Smooth transitions between facial features
    • Simple Linear Interpolation: Blurred intermediate images with less detail

Task 3: Domain-Adversarial Neural Network (DANN)

Setting MNIST-M → SVHN MNIST-M → USPS
Trained on Source 40.03% (6359/15887) 80.44% (1197/1488)
Adaptation (DANN) 52.51% (8343/15887) 93.28% (1388/1488)
Trained on Target 93.64% (14877/15887) 98.86% (1471/1488)

Score: 100/105

Assignment 3 Report

Task 1: Zero-shot image classification with CLIP

  • Validation Accuracy:
    • “This is a photo of {object}”: 67.48%
    • “This is not a photo of {object}”: 69.64%
    • “No {object}, no score.” 45.24%
    • CIFAR100 prompt templates: 82.48%

Task 2: PEFT on Vision and Language Model for Image Captioning

Method CIDEr CLIPScore
Adapter 0.964 0.733
Lora 0.901 0.726
Prefix Tuning 0.827 0.714

Task 3: Visualization of attention in image captioning

  • Example Images:
    • Correctly identified objects in two out of three example images.
    • Uncertainty in distinguishing between a tree and a bicycle in the third image.

Score: 99/100

Assignment 4 Report

Task 1: 3D Novel View Synthesis with NeRF

Settings PSNR SSIM LPIPS (vgg)
layers: 8, skips: 4, embedding: 256 43.40 0.9941 0.0986
layers: 8, skips: 4, embedding: 512 43.73 0.9945 0.0991
layers: 6, skips: 3, embedding: 256 43.82 0.9943 0.1004

Score: 100/100

Final Project: Situated Reasoning (Video Question Answering)

Overview

This project builds upon the Flipped-VQA architecture to enhance video-text representation and question answering capabilities. The project introduces significant improvements by replacing key components in the model, such as the visual encoder and the underlying language model. The enhancements are designed to improve the accuracy and overall performance of the model in predicting answers (A), questions (Q), and video frames (V) given pairs of VQ, VA, and QA.

Key Contributions

  • Visual Encoder Replacement: The visual encoder is replaced with ViCLIP, a video CLIP model specifically designed for transferrable video-text representation.

  • Language Model Upgrade: The project upgrades from LLAMA1-7B to LLAMA2-7B, a more powerful large language model (LLM).

Result

  • Int_Acc (Interaction Accuracy): 65.12
  • Seq_Acc (Sequence Accuracy): 68.38
  • Pre_Acc (Prediction Accuracy): 58.10
  • Fea_Acc (Feasibility Accuracy): 50.78
  • Mean: 60.60

ntu-dlcv-2023's People

Contributors

wuweilun avatar jiangjiangzh avatar github-classroom[bot] avatar howard891008 avatar peterscarf avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.