Giter Club home page Giter Club logo

dsg's Introduction

Davidsonian Scene Graph (DSG) Code

Note

For more info about DSG, please visit our project page.

This repository contains the code for DSG, a new framework for fine-grained text-to-image evaluation using Davidsonian semantics, as described in the paper:

Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation (ICLR 2024)

by Jaemin Cho, Yushi Hu, Jason Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, Jordi Pont-Tuset, Su Wang

Updates


Background - QG/A frameworks for text-to-image evaluation

Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation naturally depends on the quality of the underlying QG and QA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions), and (b) VQA answers should be consistent (not assert that there is no motorcycle in an image while also claiming the motorcycle is blue).

Our solution - Davidsonian Scene Graph (DSG)

We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics. DSG is an automatic, graph-based QG/A that is modularly implemented to be adaptable to any QG/A module. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark with 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We will release the DSG-1k prompts and the corresponding DSG questions.


Example usage of DSG

Below is the pseudocode for evaluating text-to-image generation models with DSG.

Please check ./t2i_eval_example.ipynb for a full example using gpt-3.5-turbo 16k as an LLM and mPLUG-large as a VQA model. You can replace the LLM and VQA models with any other models.

PROMPT_TUPLE = """Task: given input prompts,
describe each scene with skill-specific tuples ...
"""

PROMPT_DEPENDENCY = """Task: given input prompts and tuples,
describe the parent tuples of each tuple ...
"""

PROMPT_QUESTION = """Task: given input prompts and skill-specific tuples,
re-write tuple each in natural language question ...
"""

def generate_dsg(text, LLM):
    """generate DSG (tuples, dependency, and questions) from text"""
    # 1) generate atomic semantic tuples
    # output: dictionary of {tuple id: semantic tuple}
    id2tuples = LLM(text, PROMPT_TUPLE)
    # 2) generate dependency graph from the tuples
    # output: dictionary of {tuple id: ids of parent tuples}
    id2dependency = LLM(text, id2tuples, PROMPT_DEPENDENCY)
    # 3) generate questions from the tuples
    # output: dictionary of {tuple id: ids of generated questions}
    id2questions = LLM(text, id2tuples, PROMPT_QUESTION)
    return id2tuples, id2dependency, id2questions

def evaluate_image_dsg(text, generated_image, VQA, LLM):
    """evaluate a generated image with DSG"""
    # 1) generate DSG from text
    id2tuples, id2dependency, id2questions = generate_dsg(text, LLM)
    # 2) answer questions with the generated image
    id2scores = {}
    for id, question in id2questions.items():
        answer = VQA(generated_image, question)
        id2scores[id] = float(answer == 'yes')
    # 3) zero-out scores from invalid questions 
    for id, parent_ids in id2dependency.items():
        # zero-out scores if parent questions are answered 'no'
        any_parent_answered_no = False
        for parent_id in parent_ids:
            if id2scores[parent_id] == 0:
                any_parent_answered_no = True
                break
        if any_parent_answered_no:
            id2scores[id] = 0
    # 4) calculate the final score by averaging
    average_score = sum(id2scores.values()) / len(id2scores)
    return average_score

Code Structure

# Generate DSG with arbitrary LLM
query_utils.py

# Parse generation results from LLM
parse_utils.py

# Example LLM call - GPT-3.5 turbo 16k
openai_utils.py

# Example VQA call - mPLUG-large / InstructBLIP
vqa_utils.py

# Example TIFA dev json
tifa160-dev-anns.json

Setup

conda create -n dsg python=3.9
conda activate dsg
pip install -r requirements.txt

# (optional) if you want to use GPT models via OpenAI API for DSG-generating LLM
pip install openai

# (optional) if you want to use mPLUG-large / InstructBLIP for VQA models
pip install transformers==4.31.0 torch==2.0.1 "modelscope[multi-modal]" salesforce-lavis

DSG-1k score calculation (e.g., with SDv2.1 images)

python t2i_eval_dsg1k.py

The output should look like below

Click me
Evaluating T2I models on DSG1k
t2i_model: sd2dot1

Dataset category: tifa | # items: 160
Avg. score: 88.1%
Dataset category: paragraph | # items: 200
Avg. score: 85.1%
Dataset category: relation | # items: 100
Avg. score: 34.4%
Dataset category: count | # items: 100
Avg. score: 70.4%
Dataset category: real_user | # items: 200
Avg. score: 89.7%
Dataset category: pose | # items: 100
Avg. score: 89.6%
Dataset category: defying | # items: 100
Avg. score: 84.4%
Dataset category: text | # items: 100
Avg. score: 83.7%
Dataset category: all | # items: 1060
Avg. score: 80.5%

Human correlation on TIFA-160

python tifa160_correlation.py

The output should look like below

Click me
Calculate Correlation
vqa_model: mplug-large
 qg_model: tifa
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: tifa | VQA: mplug-large
VQA vs. Likert correlation
Spearman: 0.352 (p-value: 1.1094815797843165e-24)
Kendall: 0.259 (p-value: 3.654716643841926e-24)
n_samples: 800
==============================
 qg_model: vq2a
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: vq2a | VQA: mplug-large
VQA vs. Likert correlation
Spearman: 0.212 (p-value: 1.3184891767851401e-09)
Kendall: 0.164 (p-value: 1.193986513000161e-09)
n_samples: 800
==============================
 qg_model: dsg
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: dsg | VQA: mplug-large
VQA vs. Likert correlation
Spearman: 0.463 (p-value: 9.454980885221468e-44)
Kendall: 0.380 (p-value: 1.7383231097374204e-41)
n_samples: 800
==============================
 qg_model: dsg without dependency
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: dsg without dependency | VQA: mplug-large
VQA vs. Likert correlation
Spearman: 0.464 (p-value: 7.181821197264219e-44)
Kendall: 0.378 (p-value: 6.481086461681368e-41)
n_samples: 800
==============================
vqa_model: instruct-blip
 qg_model: tifa
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: tifa | VQA: instruct-blip
VQA vs. Likert correlation
Spearman: 0.460 (p-value: 4.686040918391364e-43)
Kendall: 0.360 (p-value: 8.780630004246874e-41)
n_samples: 800
==============================
 qg_model: vq2a
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: vq2a | VQA: instruct-blip
VQA vs. Likert correlation
Spearman: 0.426 (p-value: 1.5483862335996184e-36)
Kendall: 0.352 (p-value: 7.79505115720718e-34)
n_samples: 800
==============================
 qg_model: dsg
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: dsg | VQA: instruct-blip
VQA vs. Likert correlation
Spearman: 0.442 (p-value: 1.119379944290985e-39)
Kendall: 0.364 (p-value: 3.1980138119050176e-37)
n_samples: 800
==============================
 qg_model: dsg without dependency
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: dsg without dependency | VQA: instruct-blip
VQA vs. Likert correlation
Spearman: 0.436 (p-value: 1.757469672881809e-38)
Kendall: 0.355 (p-value: 2.0226556044720833e-35)
n_samples: 800
==============================
vqa_model: pali-17b
 qg_model: tifa
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: tifa | VQA: pali-17b
VQA vs. Likert correlation
Spearman: 0.431 (p-value: 1.443006704479334e-37)
Kendall: 0.323 (p-value: 4.304350427416642e-37)
n_samples: 800
==============================
 qg_model: vq2a
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: vq2a | VQA: pali-17b
VQA vs. Likert correlation
Spearman: 0.207 (p-value: 3.321822158171524e-09)
Kendall: 0.157 (p-value: 3.958251807649611e-09)
n_samples: 800
==============================
 qg_model: dsg
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: dsg | VQA: pali-17b
VQA vs. Likert correlation
Spearman: 0.571 (p-value: 2.500599870516443e-70)
Kendall: 0.458 (p-value: 1.4702926454843743e-63)
n_samples: 800
==============================
 qg_model: dsg without dependency
  t2i_model: mini-dalle
  t2i_model: sd1dot1
  t2i_model: sd1dot5
  t2i_model: sd2dot1
  t2i_model: vq-diffusion
N items: 800
N items (after filtering nan): 800
==============================
Summary
QG: dsg without dependency | VQA: pali-17b
VQA vs. Likert correlation
Spearman: 0.570 (p-value: 3.4035331067710746e-70)
Kendall: 0.457 (p-value: 4.356245185978116e-63)
n_samples: 800
==============================

Citation

If you find our project useful in your research, please cite the following paper:

@inproceedings{Cho2024DSG,
  author    = {Jaemin Cho and Yushi Hu and Jason Baldridge and Roopal Garg and Peter Anderson and Ranjay Krishna and Mohit Bansal and Jordi Pont-Tuset and Su Wang},
  title     = {Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation},
  booktitle = {ICLR},
  year      = {2024},
}

dsg's People

Contributors

j-min avatar jponttuset avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dsg's Issues

Indentation level of some codes

DSG/query_utils.py

Lines 350 to 389 in 3f844c1

# 2) Run LM calls
if verbose:
print(f"Running LM calls with {num_workers} workers.")
if num_workers == 1:
total_output = []
for kwargs in tqdm.tqdm(total_kwargs):
prompt = kwargs["prompt"]
output = generate_fn(prompt)
total_output += [output]
else:
from multiprocessing import Pool
with Pool(num_workers) as p:
total_inputs = [d['prompt'] for d in total_kwargs]
total_output = list(
tqdm.tqdm(p.imap(generate_fn, total_inputs), total=len(total_inputs)))
# 3) Postprocess LM outputs
id2outputs = {}
for i, id_ in enumerate(
tqdm.tqdm(
ids,
dynamic_ncols=True,
ncols=80,
disable=not verbose,
desc="Postprocessing LM outputs"
)
):
test_input = id2inputs[id_]["input"]
raw_prediction = total_output[i]
prediction = parse_fn(raw_prediction).strip()
out_datum = {}
out_datum["id"] = id_
out_datum["input"] = test_input
out_datum["output"] = prediction
id2outputs[id_] = out_datum

Hello,

Thank you for sharing your great work!

Upon attempting to generate a DSG, I've noticed that the indentation of certain lines may be incorrect. It appears they should be indented one level less, as the current formatting causes an exception to be thrown. Once I adjusted the indentation level, the generation process proceeded without any issues.

Qinyu

Could you open sourcing the annotation system?

Hello,

I've recently delved into the Davidsonian Scene Graph and found it absolutely fascinating. I want to follow in your footsteps and continue exploring this domain.
But now I am stuck in the construction of the annotation system, could you open source the annotation system of DSG?

Thank you for your time and consideration, and once again, kudos to the team for such an outstanding job!

Best regards!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.